Post Job Free
Sign in

Data Engineer

Location:
Atlanta, GA
Posted:
March 22, 2021

Contact this candidate

Resume:

Sapoon Jyoti

*************@*****.***

404-***-****

Data Engineer

Analytics and big data data-driven with 5+ years of experience.

Focused on technology ranging from Hadoop, Kafka, and flume to other distributed technologies like Spark while also having experience in cloud servers such as AWS.

Highly adaptable to new technologies.

Motivated and driven by the desire to produce the most error-free and usable applications.

Skills

APACHE - Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS, Hortonworks, MapR, MapReduce

SCRIPTING - HiveQL, MapReduce, XML, FTP, Python, UNIX, Shell scripting, LINUX

OPERATING SYSTEMS - Unix/Linux, Windows 10, Ubuntu, Apple OS

FILE FORMATS - Parquet, Avro & JSON, ORC, Text, CSV

DISTRIBUTIONS - Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)

DATA PROCESSING (COMPUTE) ENGINES - Apache Spark, Spark Streaming, Flink

DATA VISUALIZATION TOOLS - Pentaho, QlikView, Tableau, PowerBI, matplot

COMPUTE ENGINES - Apache Spark, Spark Streaming, Storm

DATABASE - Microsoft SQL Server Database (2005, 2008R2, 2012), Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB,

SOFTWARE - Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, PowerPoint; Technical Documentation Skills

Work History

07/19-Current

Data Engineer

AT&T, Atlanta, GA

Automated the installation and configuration of Scala, Python, Hadoop, and the necessary dependencies that configured the necessary files accordingly.

Fully configured an HDFS cluster for faster and more efficient processing and analysis of the data.

Implemented a Kafka ingestion method to ingest real-time data with multiple brokers both financially as well as Rest API data for data protection.

The Kafka implementation in this scenario was done primarily using python to fetch the required data from the APIs.

The Kafka-consumer was designed using mainly python as well with a future intent to move it to Scala to allow for further optimization while using spark.

The resulting data was stored in an MYSQL database for data scientists to run their scripts on the cleaned data.

Automated the whole pipeline including initializing Kafka and spark whenever necessary to ease of use.

Developed a final tabular dashboard using web application tools such as Flask to visualize the results after analysis in an easier manner.

Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift

Utilized AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).

Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates

Implemented security measures AWS provides, employing key concepts of AWS Identity and Access Management (IAM)

12/17- 07/19

AWS Data Engineer

Capital One, McLean, VA

Fully configured an HDFS cluster for faster and more efficient processing and analysis of the data.

Implemented a Kafka ingestion method to ingest data from Twelve Data with multiple brokers for data protection.

The Kafka-producer in this case was designed using python to fetch appropriate statistical data for 10 different companies.

Developed multiple Spark Streaming and batch Spark jobs using Java, Scala, and Python on AWS

Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.

Configured spark streaming to receive real-time data from Kafka and store to HDFS using python and Scala.

The Kafka-consumer was designed using Scala to optimize the spark application better.

Stored the data from Kafka-consumer to Hadoop to allow data scientists to be able to get the data for their usage.

Developed data transformation pipelines using Spark RDDs and Spark SQL.

Created multiple batch Spark jobs using Java

Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications

Developed metrics, attributes, filters, reports, dashboards, and created advanced chart types, visualizations, and complex calculations to manipulate the data.

Implemented a Hadoop cluster and different processing tools including Spark, MapReduce

Pushing containers into AWS ECS

Use Scala to connect to EC2 and push files to AWS S3

Installed, Configured, and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring

09/16- 12/17

Big Data Engineer

ESPN Inc, Bristol, CT

Implemented a Kafka ingestion method to ingest data from API-Football.

The implementation was done using multiple brokers to ensure data protection.

Designed a Spark application, using PySpark on top of the data ingestion tool in order to transform the data into the required shape.

Utilized Zookeeper and Spark interface for monitoring the proper execution of the Spark Streaming job.

Stored the data in a Hive table to enable users to query the data using traditional SQL commands.

Wrote shell scripts to automate data ingestion tasks.

Used Cron jobs to schedule the execution of data processing scripts.

Migrated complex MapReduce scripts to Apache Spark RDDs code.

Designed and developed ETL workflows using Scala and Python for processing structured and unstructured data from the HDFS.

05/15- 09/16

Data Engineer

DXC Technology, Tysons Corner, VA

Implemented a flume configuration to ingest data from Rest API.

Designed a Spark application, using PySpark that streamed the data from Kafka into spark datatypes.

Spark was extensively used to filter and format the data before designing the sink to store the data in Hive.

Used Hive to store the final data after the spark applications for Data Scientists to fetch and process.

Developed Spark SQL script for handling different data sets and verified its performance over MR jobs

Created Hive tables to store variable data formats of data coming from different APIs.

Imported data from the local file system, RDBMS into HDFS and Sqoop and developed a workflow in Airflow to automate the tasks of loading the data into HDFS

Evaluated various data processing techniques available in Hadoop from various perspectives to detect aberrations in data, provide output to the BI tools, etc.

Cleaned up the input data, specified the schema, processed the records, wrote UDFs, and generated the output data using Hive.

Education

Master of Science: Computer Science

Emory University - Atlanta, GA

Bachelor of Science: Computer Science

State University Of New York - Plattsburgh, New York

Certifications

Big Data 101, IBM.

Hadoop 101, IBM

Spark Fundamentals I, IBM

HIPPA Certification to work with medical data.

CITI certification to work with medical data.

.



Contact this candidate