Sign in

Cloud Data Engineer

McLean, VA
March 01, 2023

Contact this candidate




Professional Summary

Experienced with a programming background in Python, Scala, and Java that has now transitioned to the field of big data engineer. Also been involved in projects that deals with integrating robotics with the Unity game engine. Now is involved with learning all the operations, functions, and architecture of Hadoop.


9 years of experience in the development of custom Hadoop Big Data solutions, platforms, pipelines, data migration, and data visualizations.

Improving SQL skills and developing in the other Relational Database Management System which includes PostgreSQL and MSSQL Server.

Grown accustomed to working on Linux using a Ubuntu machine as well as working with Google Cloud Platform.

Created two instances in GCP where one is for development and the other for production.

Experience with working in the Hadoop ecosystem, understanding the architecture, and performing a project involving transferring data.

Research on data that involves calculating video game sales that spans across the world.

Altered the video game sales data in SQL and Microsoft Excel to make comparisons and separate them in different RDBMs.

Created classes that simulate real-life objects and write loops to perform actions on your data.

AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda)

Creation of a HDFS and implementing RDMBs into HDFS using the Sqoop tool.

Using Spark to improve Python programming skills and understanding the architecture of Spark.

Hands on experience with Hive and HBase integration, understanding the knowledge of how different they are from one another.

Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.

Worked on other file formats such as Avro and ORC

Hands-on experience with MongoDB and learned its differences from HBase and Hive

Installed and used Airflow for my project that involves using Spark

Worked on Kafka to gather API and displayed them in a terminal

Excellent Knowledge in understanding Big Data infrastructure, distributed file systems -HDFS, parallel processing - MapReduce framework, and complete Hadoop ecosystem - Hive, Hue, HBase, Zookeeper, Sqoop, Kafka-Storm, Spark, Flume, and Oozie.

In-depth knowledge of real-time ETL/Spark analytics using Spark SQL with visualization Hands-on experience on YARN (MapReduce 2.0) architecture and components such as Resource Manager, Node Manager, Container and Application Master, and execution of a MapReduce job.

Getting hands-on with NIFI and integrate it with the local terminal

Ability to troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, Hive, RDDs, DataFrame & MapReduce. Able to design elegant solutions through the use of problem statements.

In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce concepts and experience in working with MapReduce programs using Apache Hadoop for working with Big Data to analyze large datasets efficiently.


IDE: Eclipse, IntelliJ, PyCharm, DBeaver

PROJECT METHODS: Agile, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking

HADOOP DISTRIBUTIONS: Hadoop, Cloudera Hadoop, Hortonworks Hadoop

CLOUD PLATFORMS: Amazon AWS - EC2, MapR, Google Cloud Platform (GCP)


CLOUD DATABASE & TOOLS: Cassandra, Apache Hbase, SQL, MongoDB


ETL TOOLS: Flume, Kafka, Sqoop


Data Query: Spark SQL, Data Frames

PROGRAMMING LANGUAGES: Spark, Spark Streaming, Java, Python, Scala, PySpark, PyTorch, C#, HTML

SCRIPTING: Hive, MapReduce, SQL, Spark SQL, Shell Scripting


Work History

Cloud Data Engineer Nov 2021 – Present

Freddie Mac McLean, VA

Assigned to a dev team that performed work based on Agile project development/delivery principles.

Participated in daily meetings held in Microsoft Teams, with Jira used to track work tasks/progress.

Worked in a cross-functional, distribute team environment.

Set up Big Data Analysis Platform (BDAP)/Snowflake data platform.

Ensured that metadata from vendors in Snowflake and BDAP had correct elements/attributes by conducting schema checks on metadata and working with team applying validation checks and cloud computing. The metadata had different format versions: JSON, XML, and CSV.

Worked in AWS and Azure cloud environments.

Applied Anaconda to build secure and scalable Python data pipelines and machine learning workflows with the latest open-source innovation.

Used Altova XMLSpy JSON and XML editor for modeling, editing, transforming, and debugging operations.

Worked in Spyder open-source cross-platform integrated development environment (IDE) with programming in Python.

Reviewed Python code and made adjustments to fit test cases.

Displayed metadata via email and Excel spreadsheet.

Data Engineer Jun 2021 – Nov 2021

Riot Games West Los Angeles, CA

Transitioned pipelines from Databricks to Apache Airflow while editing/changing existing code. (Pipeline being transferred from Databricks to Airflow was called “Cloud Cost and Usage”.)

Set up the WSL and Docker for them to collaborate with Ubuntu.

Set up Airflow from Docker.

Configured CI/CD pipeline through GitHub.

Performed work in alignment with an Agile project delivery methodology on a team of 8, participating in weekly meetings, and using Jira to track each team members task/progress.

Utilized Riot Games VPN to gain access to their tools.

Applied AWS to ingest and allocate data daily, weekly, and monthly.

Coded in Python and SQL.

Detected errors in Docker and finding ways to get a better connection for future use.

Transitioned pipelines into Airflow in the form of DAGs.

Worked on the data lake on AWS S3 to connect it with different applications.

Utilized Great Expectations for bringing data pipelines.

Used Python and Data Frames and Spark SQL API for faster processing of data.

Big Data Engineer Sep 2018 – Jun 2021

Teletracking Raleigh, NC

Configured, assembled, and dispatched new information pipelines underway utilizing Apache Spark.

Programmed Python scripts to move data from local computer to AWS S3 Bucket.

Programmed Python scripts to retrieve data stored in AWS S3 Bucket.

Executed Hadoop/Spark jobs on AWS EMR using programs and data stored in S3 Buckets.

Designed an Apache Airflow Data Pipeline to automate the data ingestion and retrieval.

Composed Python classes to stack information from Kafka to MongoDB according to the ideal model.

Composed Scala classes to extricate information from MongoDB, since throughout the time mongo will turn into the wellspring of truth.

Composed Scala classes for functions that is getting Kafka messages each time another information is added to or refreshed in MongoDB.

Merged information among Oracle and MongoDB, utilizing the Spark system.

Performed Unit Testing utilizing PyUnit, mock items.

Composed and executed Mongo queries.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.

Worked with business on prioritization of any upgrade solicitation or bug fix.

Worked in a team to develop an ETL pipeline which involves extraction of Parquet serialized files from S3 and persist them in HDFS.

Worked on various real-time and batch processing applications using Spark/Scala, Kafka and Cassandra.

Built Spark applications to perform data enrichments and transformations using Spark Data frames with Cassandra look ups.

Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS cloud Formation templates

0Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift

AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3)

Used Data Stax Spark Cassandra Connector to extract and load data to/from Cassandra.

Developed Spark application that uses Kafka Consumer and Broker libraries to connect to Apache Kafka and consume data from the topics and ingest them into Cassandra.

Hadoop Developer Jan 2017 - Sep 2018

Aetna Hartford, CT

Installed Kafka to gather data from disperse sources and store for consumption.

Utilized diverse IDEs like PyCharm for majority of project testing and coding.

Consumed Rest APIs and added source code for the Kafka program.

Created Python file called Consumer which called on the Producer and printed the same data the Producer displays.

Worked on various real-time and batch processing applications using Spark/Scala, Kafka, and Cassandra.

Built Spark applications to perform data enrichments and transformations using Spark Data frames with Cassandra look ups.

Used Data Stax Spark Cassandra Connector to extract and load data to/from Cassandra.

Worked in a team to develop an ETL pipeline which involves extraction of Parquet serialized files from S3 and persist them in HDFS.

Developed Spark application that uses Kafka Consumer and Broker libraries to connect to Apache Kafka and consume data from the topics and ingest them into Cassandra.

Leveraged strong Skills in developing applications involving Big Data technologies like Hadoop, Spark, Map Reduce, Yarn, Hive, Pig, Kafka, Oozie, Sqoop, Hue.

Experienced in Cloudera and Hortonworks Hadoop distributions.

Worked on Apache Airflow, Apache Oozie and Azkaban.

Designed and implemented data ingestion framework to load data into the data lake for analytical purposes.

Applied knowledge on ETL development for processing the large-scale data using Big data platform.

Developed data pipelines using Hive, Pig, and MapReduce.

Wrote Map Reduce jobs.

Installed and configured the Hadoop Cluster of Major Hadoop Distributions.

Designed the reporting application that uses the Spark SQL to fetch and generate reports on Hive.

Analyzed data using Hive and wrote User Defined Functions (UDFs).

Big Data Engineer Feb 2015 - Dec 2016

Starbucks Seattle, WA

Used CSV files to import RDBMS to Sqoop Pipeline project while keeping the same format.

Created script to convert CSV files into Avro while delivering them to HDFS.

Generated an Avro schema for Hive external files/tables.

Transferred the Avro files into the Hive external Raw database using HQL.

Converted Avro files into ORC files and placed them into the internal database.

Created a new database in MySQL to function as location for the final phase in the project.

Used Spark transfer the tables from Hive internal to MySQL.

Installed and started Airflow for we will use airflow the run the script for everything that has been completed.

Created a DAG folder and a python file that would read the script to do all commands that has been done throughout the project.

Connected to Airflow using “localhost:8080” and navigate through airflow to run the DAG script. (The script ran faster in Airflow than it would in a local terminal and the logs were displayed in a graph.)

Designed frameworks at high level and detailed level for sales and customer reports.

Successfully completed migration of on-premise multiple Hadoop and Spark jobs into AWS.

Designed and developed multiple Hourly, Daily, Weekly aggregates required for sales and customer dashboard in Redshift.

Developed DAGs Implemented using Apache Airflow for multiple Spark jobs.

Involved in the requirement gathering, Sprint planning, user story building session with Agile methodologies.

Hands-on with Git and Jenkins in association with CICD pipeline.

Data Engineer Jan 2013 - Jan 2015

Bed Bath and Beyond Northern, NJ

Planned and built up a day by day cycle to do steady import of crude information from DB2 into Hive tables utilizing Sqoop.

Engaged with troubleshooting Map Reduce work utilizing MR Unit.

Broadly utilized Hive/HQL to query information in Hive Tables and stacked information into Hive.

Created data pipeline utilizing Flume, Sqoop, Spark and MapReduce to ingest information into HDFS for examination.

Utilized Oozie and Zookeeper for work process planning and monitoring.

Utilized Sqoop to move information from data sets (SQL, Oracle) to HDFS, Hive.

Incorporated Apache Storm with Kafka to perform web examination.

Transferred click stream information from Kafka to HDFS, Hbase and Hive by incorporating with Storm.

Planned Hive external tables utilizing shared meta-store with dynamic partitioning and buckets.

Dealt with moving programs into Spark changes utilizing Spark and Scala.

Planned and actualized ETL measure utilizing Talend to stack information.

Worked with Sqoop for bringing in and trading the information from HDFS to Relational Database frameworks/centralized server and the other way around.

Stacked information into HDFS.

Made simultaneous access for hive tables with shared/elite locks.

Actualized utilizing SCALA and SQL for quicker testing and preparing of information.

Effectuated constant streaming of information with KAFKA.

Utilized OOZIE for group preparing and booking work processes powerfully.

Populated HDFS and Cassandra with enormous measures of information utilizing Apache Kafka.

Used Sqoop to transfer all the tables and their data into HDFS into specific directories that were created for each one.

Added extra table into HDFS (Originally it was just an ordinary CSV file converted into a table with data. After Sqoop the tables were sent to Hive (a data warehouse software) in 2 databases that I created in it. One is called “raw” which is for external tables and another called “dsl” is for internal tables. The last step in this project was to put the tables from internal tables into a distributed big data HBase. Executed commands to create tables in HBase and call back through Hive).


Bachelor of Science: Computer Science

Tennessee State University Nashville, TN


R Programming: Data Science and Machine Learning

Data Mining with Python

Real-Life Data Science Exercises

Contact this candidate