Sign in

Senior Big Data Engineer

Sterling, VA
September 08, 2021

Contact this candidate


Profile Overview

*+ years in the Big Data space, with roles including Big Data Developer, AWS Cloud Data Engineer, Hadoop Develop, and Senior Big Data Developer.

Experienced in Amazon Web Services (AWS), and cloud services such as EMREC2, S3, EBS, and IAM

entities, roles, and users.

Importing real-time logs to Hadoop Distributed File System (HDFS) using Flume.

Experience working in Hadoop-as-a-Service (HAAS) environments, SQL and NoSQL databases

Hadoop Big Data infrastructure for batch data processing and real-time data processing.

Design and build scalable Hadoop distributed data solutions using native, Cloudera and Hortonworks, Spark, and Hive.

Experienced in PySpark, and Hadoop streaming applications with Spark Streaming and Kafka.

Handling of large datasets using partitions, Spark in-memory capabilities, broadcasts, join transformations in the ingestion process.

Performance tuning of Spark jobs in Hadoop for setting batch interval time, level of parallelism, and memory tuning, and changing the configuration properties, and using broadcast variables.

Administration of Hadoop cluster (CDM); review of log files of all daemons.

Skilled in phases of data processing (collecting, aggregating, moving from various sources) using Apache Flume and Kafka.

Able to drive architectural improvement and standardization of the environments.

Expertise in Spark for reliable real-time data processing capabilities to Enterprise Hadoop.

Extending HIVE and PIG core functionality by using custom User Defined Functions (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.

Good Knowledge of the Spark framework on both batch and real-time data processing.

Hands-on experience processing data using Spark Streaming API with Scala.

Documented big data systems, procedures, governance, and policies.

Technology Skills

The architecture of Big Data Systems:

Amazon AWS - EC2, SQS, S3, Kinesis, Azure, Google Cloud, Horton Labs, Cloudera Hadoop, Hortonworks Hadoop, MapR, Spark, Spark Streaming, Hive, Kafka, Nifi

Programming Languages

Scala, Python, Java, Bash

Hadoop Components

Hive, Pig, Zookeeper, Sqoop, Oozie, Yarn, Maven, Flume, HDFS, Apache Airflows

Hadoop Administration

Zookeeper, Oozie, Cloudera Manager, Ambari, Yarn

Data Management

Apache Cassandra, AWS Redshift, Amazon RDS, Apache Hbase, SQL, NoSQL, Elasticsearch, HDFS, Data Lake, Data Warehouse, Database, Teradata, SQL Server

The architecture of ETL Data Pipelines Apache Airflow, Hive, Pig, Sqoop, Flume, Scala, Python, Flume, Apache Kafka, Logstash


HiveQL, SQL, Pig Scripting, Shell Script Language

Big Data Frameworks

Spark and Kafka

Spark Framework

Spark API, Spark Streaming, Spark SQL, Spark Structured Streaming

Software Development

IDE: Jupyter Notebooks, PyCharm, IntelliJ Continuous Integration (CI-CD): Jenkins, Versioning: Git, GitHub, Project Method: Agile Scrum, Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing

Work History

Senior Big Data Engineer

Discovery, Inc / Sterling, VA / January 2020 to Current

Prepared scripts to automate ingestion of data in Python as needed through various sources such as API and AWS S3 vendor buckets.

Prepared SQL scripts to load ingested data in Apache Hive and Amazon Redshift using AWS Glue as an external metastore.

Wrote, compiled, and executed programs as necessary for Apache Spark in Scala to perform ETL jobs with ingested data.

Automated resulting scripts using AWS Data Pipeline and shell scripting to ensure daily execution in production.

Maintained version control and organized repositories in Github.

Created a general outline for future data pipelines to follow for ease of use and automation.

Wrote documentation for legacy support of finished projects.

Performed QA testing on data pipeline repositories using Jenkins as a CI/CD service.

Monitored past projects for outages or other issues using Amazon SNS notifications.

Worked with a rapidly growing team that had members in remote, international locations.

Met with project lead and other team members to ascertain the best method of solving challenges.

Met with required parties to prepare and plan execution for new data ingestion.

Met with clients whenever a past project would need additional requirements.

Communicated with the team daily to provide updates on current projects.

Assisted other team members through knowledge transfers and insight on their respective projects.

Collaborated with members responsible for ETL to satisfy customer requirements.

Performed day-to-day tasks regarding projects in a “semi-agile” environment.

Utilized Jira to handle tickets for troubleshooting as necessary.

Troubleshooted finished pipelines as needed if SNS messages signal failure.

Managed and launched Amazon EMR instances for development use when end-to-end testing was not feasible.

Utilized a fail-fast method for non-critical issues in development to quickly try proposed solutions.

Worked in an Agile Scrum environment, participating in Daily Scrums, Sprints, Sprint planning, etc.

Hadoop Developer

Equifax / Atlanta, GA/ November 2017 to December 2019

Installed and configured a Flume agent to ingest data from a Rest API

Used Sqoop to migrate data from RDBMSs to hive using sqoop incmrental load using sqoop jobs.

Involved in transforming data from legacy tables to HDFS, and HBase tables using Spark

Developed Oozie workflows to run multiple Hive and Spark jobs

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python.

Analyzed the SQL scripts and designed the solution to implement using PySpark.

Developed and implemented various methods to load data into HIVE tables from HDFS and Local File System.

Migrated complex MapReduce scripts to Apache Spark code.

Used SparkSQL module to store the data on Hive

Load the data into Spark RDD and do in memory data Computation to generate the Output response.

Implemented Spark using Scala and SparkSQL for faster testing and processing of data.

Analyzed data using Hive and wrote User Defined Functions (UDFs) under the direction of the data science team.

Designed and developed ETL workflows using Scala and Python for processing structured and unstructured data using Spark.

Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled Structured data using SparkSQL.

Prepared scripts to automate ingestion of data in Python as needed through various sources such as API and save it to HDFS.

Developing Spark programs using PySpark APIs to compare the performance of Spark with Hive and SQL.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Used Scala libraries to process XML data that was stored in HDFS and processed data was stored in HDFS.

Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language.

Installed flume and configured the source, channel and sink.

AWS Cloud Data Engineer

Wells Fargo / San Francisco, CA/ March 2016 to November 2017

Imported the data from different sources like AWS S3 into Spark RDD.

Imported data from AWS S3 and into Spark RDD and performed transformations and actions on RDD's.

Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.

Involved in HBASE setup and storing data into HBASE, which will be used for analysis.

Used Impala for querying HDFS data to achieve better performance.

Implemented HQL scripts to load data from and to store data into Hive.

Used the JSON and XML for serialization and de-serialization to load JSON and XML data into HIVE tables.

Tested on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.

Used Avro, Parquet and ORC data formats to store in to HDFS.

Deployed to Amazon Web Services (AWS) Cloud services like EC2, S3, EBS, RDS and VPC.

Deployed to various HDFS file formats like Avro, Sequence File and various compression formats like Snappy.


MetLife Insurance / New York, NY / November 2014 to March 2016

Translated a logical database design or data model into an actual physical database implementation.

Designed and configured MySQL server cluster and managed each node on the Cluster.

Performed security audits of MySQL internal tables and user access, and revoked access for unauthorized users.

Developed database architectural strategies at the modeling, design, and implementation stages.

Analyzed and profiled data for quality and reconciled data issues using SQL.

Applied performance tuning to improve issues with a large, high-volume, multi-server MySQL installation for job applicant site of clients.

Defined procedures to simplify database upgrades.

Standardized MySQL installs on all servers with custom configurations.

Modified database schemas.

Set up replication for disaster and point-in-time recovery.

Maintained existing database systems.


Bachelor of Science - Computer Science - Claflin University

Contact this candidate