Resume

BIG DATA ENGINEER

Location:

Livonia, MI

Posted:

March 11, 2021

Resume:

*+ years of big data engineering and information technology experience in Hadoop environment both on-prem and cloud. Expertise in Hadoop components and ecosystem and migrating legacy technologies to AWS cloud.

6 years - Big Data Engineering / Information Technology

Summary of Competencies

6+ years of experience in development of custom Hadoop Big Data solutions, platforms, pipelines, data migration, and data visualizations.

Creation of ETL processes to transform data to one consistent format for data cleansing and analysis.

Performed streaming data ingestion to the Spark distribution environment, using Kafka.

Integrated Kafka with Spark streaming for high-speed data processing.

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Spark SQL to create real-time processing of structured data with Spark Streaming processed through structured streaming.

Created Lambda to process the data from S3 to Spark for structured streaming to get structured data by schema.

Set the Spark job to process the data to Redshift and EMR HDFS(Hadoop).

Documented the requirements including the available code implemented using Spark, Amazon DynamoDB, Redshift and Elastic Search.

Implemented Kafka messaging consumer

Broadcast variables in Spark, effective & efficient Joins, transformations.

Implemented Spark and Spark SQL for faster testing and processing.

Performance tuning of Spark jobs for setting batch interval time, level of parallelism, and memory tuning.

Developed POC using Scala & deployed on Yarn cluster, compared the performance of Spark, with Hive and SQL.

Install and configure Kafka cluster and monitoring the cluster; Architected a lightweight Kafka broker; integration of Kafka with Spark for real time data processing.

Build a Spark proof of concept with Python using PySpark

Spark applications using Spark Core, Spark SQL and Spark Streaming API

Technical Skills Profile

Big Data

RDDs, UDFs, Data Frames, Datasets, Pipelines, Data Lakes, Data Warehouse, Data Analysis

Hadoop

Hadoop, Cloudera (CDH), Hortonworks Data Platform (HDP)

Spark

Apache Spark, Spark Streaming, Spark API, Spark SQL

Hadoop Components

HDFS, Hive, Pig, Zookeeper, Sqoop, Oozie, Yarn

Apache

Apache Kafka, Apache MAVEN, Apache Oozie, Apache Pig, Apache Sqoop, Apache Flume, Apache Hadoop, Apache HBase, Apache Cassandra, Apache Lucene, Apache SOLR, Apache Airflow, Apache Camel, Apache Mesos, Apache Tez, Apache Zookeeper

Programming

Java, PySpark, Python, Spark, Scala

Development

Agile, Scrum, Continuous Integration, Test-Driven Development (TDD), Unit Testing, Functional Testing, Git, GitHub, Jenkins CI (CI/CD for continuous integration)

Query Language

SQL, Spark SQL, Hive QL

Database

Apache Cassandra, AWS Redshift, AmazonRDS, Apache Hbase, SQL, NoSQL, Elasticsearch

File Management

HDFS, Parquet, Avro, Snappy, Gzip, Orc

Cloud Platforms

AWS Amazon Cloud

Security and Authentication

AWS IAM, Kerberos

AWS Amazon Components

AWS Lambda, AWS S3, AWS RDS, AWS EMR, AWS Redshift,AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud Formation, AWS IAM

Virtualization

VMware, VirtualBox, OSI, Docker

Data Visualization

Tableau, Kibana Crystal Reports 2016, IBM Wats

Cluster Security

Ranger, Kerberos

Query Processing

Spark SQL, HiveQL

Data Frames

Professional Experience Profile

BIG DATA CLOUD ENGINEER

AAA Life Insurance, Livonia, MI May 2020-Present

Architected and built a new Aws Organizations Cloud environment with PCI, PHI, PII compliance

Worked on creating new data lake ingesting data from on-prem and other clouds to s3 and redshift, and RDS

Used Terraform Enterprise and GitLab to deploy IAC to various AWS accounts

Integrated big data spark jobs with EMR and glue to create ETL jobs for around 450 GB of data daily

Optimized EMR clusters with partitioning and parquet format to increase speeds and efficiently in ETL

Created new Redshift cluster for data science using Quicksight for reporting and mobile visualization

Created training material for others and assist others with interacting with AWS boto3 and lambda

Developed new API Gateway for streaming to Kinesis and ingestion of event streaming data

Implemented AWS Step-Functions as orchestration and cloud watch events for automation of pipelines

Used Hive Glue data catalog to obtain and validate schema of data and lake formation for data governance

Worked and maintained Git Lab repository using Git, Bash, and Ubuntu for working on project code

Created metadata tables for Redshift Spectrum and amazon Athena for serverless querying ad hoc

Tuned EMR cluster for big data with different gzip and parquet data formats and compression types

Assisted Data Admin and cloud admin with initial set up of IAM and Postgres users’ permissions

Debugged and solved various errors and issues in the AWS environment using cloud watch logs

Created s3 bucket structure and data lake layout for optimal use of glue crawlers and s3 buckets

Used Terraform to create various lambda functions for orchestrations and automations

Set up and configured AWS cli and boto3 for interacting with cloud environment

Extracted Data from Redshift clusters with SQL cli tool to identify issues with data science long queries

Used Spark and PySpark for Streaming and batch applications on many ETL jobs to and from data sources

Built queries in Redshift to implement SCDs and maintain historical information in Redshift

Built Machine Learning Pipelines for SageMaker for Data Science and Machine Learning Engineers

Trained Data for Machine learning prediction models for deployment of endpoints to consumers

Developed PySpark code to optimize use with RDDs Dataframes, and internal data structures

Coded step function for kinesis and kinesis firehose using DynamoDB for metadata store

Used AWS KMS for server-side encryption and environment variables for sensitive data for compliance

SENIOR BIG DATA ENGINEER

Robinhood – Menlo Park, CA July 2018-May 2020

Configured Spark streaming to receive real time data from Kafka and store to HDFS

Worked on disaster management with Hadoop cluster (standby vs secondary)

Use Spark streaming with Kafka and MongoDB to build continuous ETL pipeline for real time analytics

Import and export of dataset transfers between traditional databases and HDFS using Sqoop

Managed ETL jobs with UDFs in pig scripts with spark before for transformations, joins, aggregations before HDFS

Preformed performance tuning for Spark Streaming setting right batch internal time, correct level of parallelism, selection of correct Serialization and memory tuning

Data ingestion using Flume with source as Kafka Source and sink as HDFS

Used Spark SQL and Data Frames API to load structured and semi structured data into Spark Clusters

Experience in optimizing the data storage in Hive using partitioning and bucketing mechanisms on managed and external tables

Collected, aggregated and moved data from servers to HDFS using Apache Spark & Spark Streaming

Used Spark API over Hadoop YARN to perform analytics on data in Hive

Worked on Multi Clustered environment and setting up Cloudera and Hortonworks Hadoop echo-System.

Developed Scala scripts on Spark to perform operations as data inspection, cleaning, loading and transforms the large sets of JSON data to Parquet format.

Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

Prepared Spark builds from MapReduce source code for better performance.

Used Spark API over Hortonworks, Hadoop YARN to perform analytics on data in Hive.

Exploring with Spark improving performance and optimization of the existing algorithms in Hadoop MapReduce using Spark Context, Spark-SQL, Data Frames, Pair RDD's and Spark YARN.

Integrated Kafka with Spark Streaming for real time data processing

Moved transformed data to Spark cluster where the data is set to go live on the application using Kafka.

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Handling schema changes in data stream using Kafka.

Analyzed and tuned data model Cassandra tables during DB2 to Cassandra migration process.

Initialized a data modeling of Cassandra to updated and maintained Chef cookbook.

AWS BIG DATA ENGINEER

Capital One, Mclean, VA May 2017-July 2018

Created highly scalable, resilient, and performant architecture using amazon AWS cloud technologies such as simple storage service, S3, Elastic Map Reduce (EMR), Elastic Cloud Compute (EC2), Elastic Container Service (EC2), Lambda, Elastic Load Balancing (ELB)

Deployed containerized applications using Docker, allowing for standardized service infrastructure.

Monitored production software with logging, visualizing, and incident management software such as slunk, Kibana

Took advantage of new Spark Avro functionality through upgrading

Provided live demonstrations of software systems to nontechnical, executive level personnel, showing how the systems were meeting business goals and objectives.

Spark clusters exclusively from the AWS Management Console.

Made and oversaw cloud VMs with AWS EC2 command line clients and AWS administration reassure.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.

Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift

Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3.

Populating database tables via AWS Kinesis Firehose and AWS Redshift.

Automated the installation of ELK agent (file beat) with Ansible playbook.

AWS Cloud Formation templates used for Terraform with existing plugins.

AWS IAM was used for creating new users and groups.

BIG DATA ENGINEER

Alibaba – Remote – Hangzhou, China Mar 2016-May 2017

Building scalable distributed data solutions using Hadoop.

Installed and configured Pig for ETL jobs and make sure Pig scripts with regular expression for data cleaning

Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows

Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the jobs that extract the data in a timely manner

Move data from Oracle to HDFS and vice versa using Sqoop

Imported data using Sqoop and load data from MySQL and oracle to HDFS on regular basis

Used Linux shell scripts to automate the build process, and to perform regular jobs like, file transfers between with different hosts

Documented Technical Specs, Dataflow, Data Models, Class Models

Successfully loading files to HDFS from Teradata, and loaded from HDFS to HIVE

Worked on installing cluster, commissioning, and decommissioning of data node, name node, recovery, capacity planning, and slots configuration

Captured data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.

Worked on analyzing Hadoop cluster and different big data analytic tools including Hive and Spark, and managed Hadoop log files.

Designed Hive queries to perform data analysis, data transfer and table design.

Created Hive tables, loading with data and writing hive queries to process the data.

Used Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.

Loaded the data from different source such as HDFS or HBase into Spark RDD and do in memory data computation to generate the output response.

Handled the data exchange between HDFS and different Web Applications and databases using Flume and Sqoop.

Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.

Created UNIX shell scripts to automate the build process, and to perform regular jobs like file transfers between different hosts.

HADOOP DATA DEVELOPER

Gulfstream, Savannah, GA Jan 2015-Mar 2016

Deployed the application jar files into AWS instances.

Used the image files of an instance to create instances containing Hadoop installed and running

Developed a task execution framework on EC2 instances using SQL and DynamoDB

Designed a cost-effective archival platform for storing big data between then using Sqoop and various ETL tools

Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop

Used hive with spark streaming for real-time processing

Imported data from different sources into Spark RDD for processing

Built a prototype for real-time analysis using Spark streaming and Kafka

Transferred data using Informatica tool from AWS S3

Used AWS Redshift for storing data on cloud

Collected the business requirements from the subject matter experts like data scientist and business partners

worked on streaming the analyzed data to Hive Tables using Sqoop for making it available for visualization and report generation by the BI team

Configured Oozie workflow engine scheduler to run multiple Hive, Sqoop and pig jobs

Used NoSQL databased like MongoDB in implantation and integration

Used Ambari stack to manage big data clusters, and performed upgrades for Ambari stack, elastic search etc.

Installed and configured Tableau Desktop on one of the three nodes to connect to the Hortonworks Hive Framework (Database) through the Hortonworks ODBC connector for further analytics of the cluster.

Assist in Install and configuration of Hive, Pig, Sqoop, Flume, Oozie and HBase on the Hadoop cluster with latest patches.

Install and configuration of Hive, Pig, Sqoop, Flume, Oozie and HBase on the Hadoop cluster with latest patches.

Created Oozie workflow for various tasks like Similarity matching and consolidation

Enabled security to the cluster using Kerberos and integrated clusters with LDAP at Enterprise level.

Implemented user access with Kerberos and cluster security using Ranger.

Education

Bachelor of Science Degree in Finance, College of Charleston, SC

Certification

AWS Certified Solutions Architect – Associate

AWS Certified Developer – Associate

Contact this candidate