Data Aws

Location:

United States

Posted:

January 27, 2021

Contact this candidate

Resume:

* ***** ** *** **** experience in different industries and environments.

Utilized Spark Data Frame and Data Set through Spark SQL API for optimized processing.

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Strong understanding of NoSQL databases and hands on work experience in writing applications on NoSQL databases like HBase, Cassandra and Elasticsearch.

Kafka for streaming data ingestion to the Spark distribution environment.

Performed data ingestion, entity resolution and ran ad-hoc queries using HDFS and Hive.

Created multiple reports using data residing in Hive per the request of the client.

Created External Hive tables to store the processed results in a tabular format.

Developed new flume agents to extract log data from data sources into Hadoop file system (HDFS).

Constructed a Kafka broker with proper configurations for the needs of the organization.

Made recommendations and significant improvements through CICD automation.

Infrastructure design for ELK Clusters.

Coordinated Kafka operation and monitoring with dev ops personnel.

Worked on Multi Clustered environment and set up Hortonworks Hadoop ecosystem.

Used Jenkins with Git for CICD integration.

Implemented advanced procedures of feature engineering for data science team using the in-memory computing capabilities like Apache Spark written in Scala.

Used Hive Query Language (HQL) for getting customer insights, to be used for critical decision making by business users.

Hands-on experience with Spark streaming to receive real time data from Kafka.

Experience with Spark Structured Streaming to process structured streaming data.

Hands-on experience using Apache Spark framework with Scala.

Experience with multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) and Redshift.

AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda).

Created Hive Managed and External tables with partition and bucket in Hive and loaded data in to Hives.

Developed data queries using HiveQL and optimized the Hive queries.

oCoding

oSpark

oSpark Sql

oSpark Streaming

oSpark Structured Streaming

oScala

oPyspark

oKafka

oShell Script Language

DevOps

oCICD

oJenkins

oCloud

oAWS RDS

oAWS EMR

oAWS Redshift

oAWS S3

oAWS Lambda

oAWS Kinesis

oAWS ELK

oAWS Cloud Formation

oAWS IAM

oHadoop

oHDFS

oHive

oAdmin

oAmbari

oZookeeper

oOozie

oWorkflows

oETL

oSqoop

oDatabase

oHbase, Cassandra

oElasticesearch

oStacks

oHadoop, ELK

oBig Data Administrator

oHortonworks

oHDP, Cluster

oYarn

oCluster Security

oKerberos, Ranger

oData Visualization

oKibana

oDistributions

oHortonworks (HDP)

oCloudera (CDH)

Big Data Engineer

JP Morgan Chase

Newark, DE

October 2019- Present

Responsible for monitoring and maintaining the archival system.

Interacted with data residing in HDFS using PySpark to process the data

Wrote a PySpark program to parse out the needed data by using Spark Context and select the columns with target information and assigned names

Manage and deploy automation scripts over processing scripts

Responsible for transferring the data from the cluster to a long-term storage system

Participated in project planning for WebMD’s business needs and technical challenges of the

Troubleshooted and optimized the archival system.

Solely Responsible for automating the archival system to streamline the day-to-day processes.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

Developed multiple Spark Streaming and batch Spark jobs using Python on AWS

Defined and optimized Spark jobs using techniques as parameter or flag optimization

Automation and archival system development were done in Python

Ingested data from S3 and have it stored in MariaDB.

Wrote Airflow scripts to automate the system

Performed coverage and unit testing on the python scripts using pytest

Increased the code performance and code coverage 40%.

Worked with terabytes of data in HDFS.

Developed archival scripts that can be run either in Jupyter notebook or on the command line

Developed multiple processing jobs to filter and transform data

Archival reports are in Microsoft Excel format

All new code contributions must have at least 75% code coverage

Used PySpark modules to store the data on HDFS

Technologies: Python, HDFS, Pyspark, Pandas, Jupyter Notebook, Pytest, Coverage.py, Bitbucket, Git, Intellij, Vim, Essbase, CSV, MS Excel, Notepad++, Windows, MS Outlook, Putty, Linux, Parquet, SonarQube, Active Directory, Jenkins, Apache Airflow, Confluence, Jira, MariaDB, GOS DB, AWS S3

Big Data Enterprise Architect

Adobe

San Jose, CA

June 2019 – October 2019

This project involved taking data provided by WebMD and formatting it for ingestion into Adobe Audience Manager, which accepts a tab-separated format. Handling schema changes and schema evolution.

Flattened dataset rows to produce single rows for each User ID.

Participated in project planning for WebMD’s business needs and technical challenges of the project.

Managed communication with remote teams in subsequent meetings to align and control business processes.

Worked with and mentored two junior developers.

Used private Git repository for code base and version control and the team used SalesForce for tracking tasks and reporting.

Administered AWS with IAM privileges for AWS security.

Created an architecture using AWS Lambda in conjunction was an Adobe-designed task scheduler similar to Airflow.

Implemented modifications involving PySpark within a Python application framework.

Data consumption from WebMD stored in Google Cloud, consisting of network clicks and network impressions. (Clickstream) These data sets were to be combined, then flattened so that there was only one row for each unique user ID.

Handled thousands of Files with sizes usually between 1 and 5 GB each compressed.

Managed network clicks data contained about 10 million rows, and Network Impressions contained about 8 billion rows in total.

Data was brought in a PySpark application running on an EMR cluster for processing, to be output to AWS S3.

Optimized data using DataFrame API migrating from old group with joins to the new datasets API with catalyst optimizer.

Improved performance of Data Pipeline by transitioning from the existing Data Frame API and by modifying the schema as a Python list, then applying it to the data.

Worked with the resultant data structure as a Data Frame with a single column and tab-separated values, which was saved to a tab-separated CSV file.

Resolved the problem of duplicate columns, and thus the inefficiency of the job as a whole.

Achieved the goal of process the entire dataset files (1.34 GB) in under 5 minutes.

Optimized code to run locally in 1 minute 51 seconds. With a total execution time running on EMR and ingesting and outputting to the cloud, in 3 minutes 45 seconds.

Technologies: Python, Pyspark, AWS S3, AWS Lambda, AWS EMR, Google Cloud Storage

Big Data Engineer

Home Depot

Atlanta, GA

Mar 2018 – June2019

Worked with analysts to model Cassandra tables from business rules and enhance/optimize the existing tables.

Versioning with Git and set-up a Jenkins CI to manage CICD practices.

Created Infrastructure design for ELK Clusters.

Handled over millions of messages per a day funneled through Kafka topics.

Implemented advanced procedures of feature engineering for data science team using the in-memory computing capabilities like Apache Spark written in Scala.

Used Hive Query Language (HQL) for getting customer insights, to be used for critical decision making by business users.

Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming and HBase.

Used Spark Structured Streaming with Spark SQL engine to process real time structured data.

Built a model of the data processing by using the PySpark programs for proof of concept.

Used Apache Spark framework with Scala mainly.

Optimizing the Hive queries using Partitioning and Bucketing techniques.

Created Hive Generic UDF's to process business logic.

Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables.

Integrated Kafka with Spark Streaming for real time data processing

Moved transformed data to Spark cluster where the data is set to go live on the application using Kafka.

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Handling schema changes in data stream using Kafka.

Big Data Engineer

PFIZER

New York, NY

Jan 2017 – Mar 2018

Created basic infrastructure of the pipeline using AWS Cloud Formation.

Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring.

Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

Implemented AWS IAM user roles and policies to authenticate and control access.

AWS Kinesis used for real time data processing.

Worked with AWS Lambda functions for event-driven processing to various AWS resources.

Managed AWS Redshift clusters such as launching the cluster by specifying the nodes and performing the data analysis queries.

Managed and monitored AWS EC2 instances through AWS Management Console

RDS, Cloud Formation, AWS IAM and Security Group in Public and Private Subnets in VPC.

Worked on Multiple AWS instances, set the security groups, Elastic Load Balancer and AMIs, Auto scaling to design cost effective, fault tolerant and highly available systems on AWS.

Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications.

Built Jenkins jobs for CI/CD infrastructure from GitHub repos

Big Data Developer

SANMINA

Huntsville, AL

Oct 2015 – Dec 2016

Cloudera Hadoop distribution version CDH5 for executing the respective scripts.

Used Cloudera Manager for maintaining heathy cluster.

Experience collecting log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.

Used HiveQL scripts to create and load data into diverse Hive tables

Hands-on experience in working with job scheduling with Oozie.

Developed Shell Scripts, Oozie Scripts and Python Scripts.

Responsible for data loading techniques like Sqoop, Flume.

Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows.

Handled importing of data from RDBMS into HDFS using Sqoop.

Support for the clusters, topics on the Kafka manager.

Worked on Kafka cluster environment and zookeeper.

Knowledge of setting up Kafka cluster.

Hadoop Developer

MATTEL

El Segundo, CA

Apr 2014 – Oct 2015

Monitored Hadoop cluster using tools like Ambari

Responsible for performance optimization of clusters.

Set-up Hortonworks Infrastructure from configuring clusters to Node security using Kerberos.

Implemented applications on Hadoop/Spark on Kerberos secured cluster.

Installed Oozie workflow engine to run multiple Hive Jobs.

Deep understanding and implementations of various methods to load HIVE tables from HDFS and Local File System.

Migrated the data using Sqoop from HDFS to Relational Database System.

Bachelor of Arts in Economics

University of Nevada at Las Vegas

Las Vegas, NV

PROFILE

SKILLS

EXPERIENCE

EDUCATION

Contact this candidate