Senior Data Engineer
Phone: (703) 936 – 4037
Eight years’ experience in Big Data projects applying Apache Spark, HIVE, Apache Kafka, and Hadoop solutions while supporting project teams in roles such as Big Data Engineer, Big Data Developer, Big Data Analyst, and Hadoop Engineer.
Demonstrated technical skill applying solutions using leading Big Data technologies such as Microsoft Azure, Apache Kafka, Python, Apache Spark, HIVE, Apache Kafka, Hadoop, and Amazon Warehouse Services (AWS).
Proven track record working on Big Data technology solutions development teams that operate within an Agile project design/development/deployment methodology to build scalable, cost-effective solutions for handling massive amount of data.
Vast knowledge and experience with data loading, processing, on-premises and cloud platforms, distributed systems, Hadoop File System (HDFS) architectures, and multiple data processing frameworks.
Five roles over eight years with roles including Hadoop Big Data Engineer, Hadoop Data Engineer, Big Data Developer, Hadoop Enterprise Architect, and Data Engineer.
• Apache: Flume, Hadoop, Yarn, Hive, Kafka, Maven, Oozie, Spark, Tez, Zookeeper, Cloudera Impala.
• Compute Engines: Apache Spark, Flink, MapReduce.
• Databases/Languages: Microsoft SQL Server Database (2005, 2008R2, 2012), Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB, Oracle, MySQL, Spark SQL.
• Operating Systems: Unix/Linux, Windows.
• Open-Source Frameworks: Apache Hadoop, Apache Spark, Apache Storm, Apache Cassandra, Hortonworks Data Platform (HDP).
• Programming Languages: Python, SQL, Hive SQL, HBase SQL.
• Data Structures/Storage/Data Import/Export: Big Data Sqoop (SQL-to-Hadoop), Flume, CSV Files, JSON, SSMS, TSQL, Text.
• Distributions: Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS).
• Data Visualization Tools: Pentaho, QlikView, Tableau, PowerBI, Matplot.
• Cloud Platforms: AWS (Amazon Warehouse Services), Microsoft Azure, GCP.
DATA ENGINEER February 2020 – Present
Freddie Mac McLean, VA
• Define the Spark/Python (PySpark) ETL framework and best practices for development.
• Implement AWS EMR Spark using PySpark and utilized DataFrames and SparkSQL API for faster processing of data.
• Register datasets to AWS Glue through Rest API.
• Use AWS API Gateway to Trigger Lambda functions.
• Query with Athena on data residing in AWS S3 bucket.
• AWS Step function used to run a data pipeline.
• Use DynamoDB to store metadata and logs.
• Compose Python classes to stack information from Kafka to DynamoDB according to the ideal model.
• Monitor and manage services with AWS CloudWatch.
• Perform transformations using Apache SparkSQL.
• Write Spark applications for data validation, cleansing, transformation, and customed aggregation.
• Develop Spark code using Python and Spark-SQL for faster testing and data processing.
• Tune Spark to increase job's performance.
• Configure ODBC Driver, Presto Driver with Okera and RapidSQL.
Hadoop Enterprise Architect May 2018 to Feb 2020
American Express Phoenix, AZ
• Worked embedded with core team on-site and collaborated remotely with offshore teams.
• Oversaw Big Data engineering tasks for team of eleven on-site engineers and other offshore teams.
• Developed Data Pipeline with Kafka and Spark.
• Developed Spark Applications by using Scala, Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
• Created a Kafka broker which uses schema to fetch structured data in structured streaming.
• Fixed Hive to Hive connection using Python and Spark to optimize performance.
• Converted Hive/SQL queries into Spark transformations with Spark RDDs and Python.
• Created POC for migrating existing MapR Hive on MR systems to Apache Spark 2.4.7, resulting in 1000% faster query responses.
• Trained engineering staff on engineering best practices and technologies, including TDD, BDD, automated testing, change impact analysis, Hadoop/HDFS, Apache Hive.
• Updated onboarding process and documentation for new hire engineers, shortening the onboarding process from two weeks to two days.
• Led effort in refreshing outdated automated test suite, bringing failing test count from over 150 down to zero.
• Streamlined Unix shell and Hive program scripts, decreasing program startup time by a factor of 10.
• Introduced system to create and review pull requests in Jira for feature enhancements to production software, reducing defects and regressions by more than 300%.
• Served as liaison between Big Data team, Java Development Team and Web Development teams, resolving conflicts and creating and assigning user stories in Jira.
• Set-up automated continuous integration processes using private Jenkins server.
• Standardized team project management, source control, and documentation using the Atlassian Enterprise Jira/Confluence/Bitbucket stack.
• Created UDFs and stored procedures to add customizable, modular, and reusable functionality to existing Hive, Pig, and SQL programs.
BIG DATA DEVELOPER April 2016 – May 2018
AnswerLab San Francisco, CA
• Installed Oozie workflow engine to run multiple spark Jobs.
• Wrote Spark SQL queries and optimized the Spark queries with Spark SQL.
• Experienced in importing real-time logs to HDFS using Flume.
• Created UNIX shell scripts to automate the build process, and to perform regular jobs like file transfers.
• Hortonworks used for installation of Hortonwork Cluster and performance monitoring.
• Made incremental imports to Hive with Sqoop.
• Managed Hadoop clusters and check the status of clusters using Ambari.
• Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables.
• Initiated data migration from/to traditional RDBMS with Apache Sqoop.
• Developed Python scripts to automate the workflow processes and generate reports.
Hadoop Data Engineer January 2015 to April 2016
Gulfstream Savannah, GA
• Used the image files of an instance to create instances containing Hadoop installed and running.
• Developed a task execution framework on EC2 instances using SQL and DynamoDB.
• Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.
• Connected various data centers and transferred data between them using Sqoop and various ETL tools.
• Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop.
• Used the Hive JDBC to verify the data stored in the Hadoop cluster.
• Integrated Kafka with Spark Streaming for real time data processing
• Imported data from disparate sources into Spark RDD for processing.
• Built a prototype for real-time analysis using Spark streaming and Kafka.
• Transferred data using Informatica tool from AWS S3.
• Using AWS Redshift for storing the data on cloud.
• Collected the business requirements from the subject matter experts like data scientists and business partners.
• Involved in Design and Development of technical specifications using Hadoop technologies.
• Load and transform large sets of structured, semi structured and unstructured data.
• Used different file formats like Text files, Sequence Files, Avro, Parquet, and ORC.
• Loaded data from various data sources into HDFS using Kafka.
• Tuning and operating Spark and its related technologies like Spark SQL and Streaming.
• Used shell scripts to dump the data from MySQL to HDFS.
• Used NoSQL databases like MongoDB in implementation and integration.
• Worked on streaming the analyzed data to Hive Tables using Sqoop for making it available for visualization and report generation by the BI team.
• Configured Oozie workflow engine scheduler to run multiple Hive, Sqoop and pig jobs.
• Consumed the data from Kafka queue using Storm
Hadoop Big Data Engineer September 2013 – December 2014
Intuit, Inc. Menlo Park, CA
• Worked on tuning the performance Pig queries.
• Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required. Worked with Flume to load the log data from multiple sources directly into HDFS.
• Performance tuning and troubleshooting of MapReduce by reviewing and analyzing log files.
• Responsible to manage data coming from different sources.
• Involved in loading data from UNIX file system to HDFS.
• Load and transform large sets of structured, semi structured and unstructured data
• Cluster coordination services through Zookeeper.
• Experience in managing and reviewing Hadoop log files.
• Job management using Fair scheduler. Involved in scheduling Oozie workflow engine to run multiple Hive, Sqoop and Pig jobs.
• Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and Troubleshooting, manage and review data backups, manage and review Hadoop log files.
• Installed Oozie workflow engine to run multiple Hive and pig jobs.
• Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
• Developed Sqoop jobs to populate Hive external tables using incremental loads
• Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop.
• Crawled some websites using Python and collected information about users, questions asked, and the answers posted.
• Hands-on experience in developing web applications using Python on Linux and UNIX platform.
• Experience in Automation Testing, Software Development Life Cycle (SDLC) using the Waterfall Model and good understanding of Agile Methodology.
Wayne State University - PhD Physics