Sr. Data Engineer email@example.com
Around 7+ years of IT experience in a variety of industries, which includes hands on experience in Big Data Hadoop and Java development
Expertise with the tools in Hadoop Ecosystem including Spark, Hive, HDFS, MapReduce, Sqoop, Kafka, Yarn, Oozie, and HBase.
Excellent knowledge on Distributed components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
Experience in designing and developing production ready data processing applications in Spark using Scala/Python.
Strong experience creating most efficient Spark applications for performing various kinds of data transformations like data cleansing, de-normalization, various kinds of joins, data aggregation.
Good experience fine-tuning Spark applications utilizing various concepts like Broadcasting, increasing shuffle parallelism, caching/persisting DataFrames, sizing executors appropriately to utilize the available resources in the cluster effectively etc.,
Strong experience automating data engineering pipelines utilizing proper standards and best practices (right partitioning, right file formats, incremental loads by maintaining previous state etc.,)
Good Knowledge in productionizing Machine Learning pipelines (Featurization, Learning, Scoring, Evaluation) primarily using Spark ML libraries.
Good exposure with Agile software development process.
Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
Strong experience on Hadoop distributions like Cloudera, Hortonworks, AWS and Azure Databricks.
Good understanding of NoSQL databases and hands-on work experience in writing applications on NoSQL databases like HBase, Cassandra and MongoDB.
Experienced in writing complex MapReduce programs that work with different file formats like Text, Sequence, Xml, parquet, and Avro.
Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
Experience in migrating the data using Sqoop from HDFS to Relational Database System and vice-versa.
Extensive Experience on importing and exporting data using stream processing platforms like Flume and Kafka.
Very good experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
Excellent Java development skills using J2EE, J2SE, Servlets, JSP, EJB, JDBC, SOAP and RESTful web services.
Experience in database design using PL/SQL to write Stored Procedures, Functions, Triggers and strong experience in writing complex queries for Oracle.
Experienced in working with Amazon Web Services (AWS) using S3, EMR, Redshift, Athena, Glue Metastore etc.,
Strong experience in Object-Oriented Design, Analysis, Development, Testing and Maintenance.
Experienced in using agile approaches, including Extreme Programming, Test-Driven Development and Agile Scrum.
Worked in large and small teams for systems requirement, design & development.
Key participant in all phases of software development life cycle with Analysis, Design, Development, Integration, Implementation, Debugging, and Testing of Software Applications in client server environment, Object Oriented Experience in using various IDEs Eclipse, IntelliJ and repositories SVN and Git.
Experience of using build tools SBT, Maven.
Bachelor of Technology in Computer Science (2012)
Python, Java, Scala
Big Data Technologies
Spark, HDFS, Map Reduce, HIVE, HBase, Sqoop, Flume, Oozie, Kafka, Impala.
Cloudera, Hortonworks, Azure Databricks, AWS EMR
HBase, Dynamo DB
GitHub, Bitbucket, CVS, SVN
Ant, Maven, Gradle
AWS S3, EMR, Redshift, Athena, Glue, Lambda functions
Client: JPMC, Columbus, OH Jan 2021 – Present
Ingested user behavioral data from external servers such as FTP server and S3 buckets on daily basis using custom Input Adapters.
Created Sqoop scripts to import/export user profile data from RDBMS to S3 Data Lake.
Developed various spark applications using Scala to perform various enrichments of user behavioral data (click stream data) merged with user profile data.
Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for downstream model learning and reporting.
Utilized Spark Scala API to implement batch processing of jobs
Troubleshooting Spark applications for improved error tolerance.
Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines
Created Kafka producer API to send live-stream data into various Kafka topics.
Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
Utilized Spark in Memory capabilities, to handle large datasets.
Used broadcast variables in spark, effective & efficient Joins, transformations, and other capabilities for data processing.
Experienced in working with EMR cluster and S3 in AWS cloud.
Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
Involved in continuous Integration of application using Jenkins.
Environnent: AWS EMR, Spark, Hive, HDFS, Sqoop, Kafka, Oozie, HBase, Scala, MapReduce.
Client : Aetna, Hartford, CT
Data Engineer November 2019 – December 2021
Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena.
Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud services.
Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.
Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.
Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.
Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.
Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.
Worked on automating the infrastructure setup, launching and termination EMR clusters etc.,
Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.
Worked on creating Kafka producers using Kafka Java Producer Api for connecting to external Rest live stream application and producing messages to Kafka topic.
Environment: AWS S3, EMR, Redshift, Athena, Glue, Spark, Scala, Python, Java, Hive, Kafka
Bank OF West, SFO, CA Jan2019 – Oct 2019
Involved in importing and exporting data between Hadoop Data Lake and Relational Systems like Oracle, MySQL using Sqoop.
Involved in developing spark applications to perform ELT kind of operations on the data.
Modified existing MapReduce jobs to Spark transformations and actions by utilizing Spark RDDs, DataFrames and Spark SQL API’s
Utilized Hive partitioning, Bucketing and performed various kinds of joins on Hive tables
Involved in creating Hive external tables to perform ETL on data that is produced on daily basis
Validated the data being ingested into Hive for further filtering and cleansing.
Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations
Loaded data into hive tables from spark and used Parquet columnar format.
Created Oozie workflows to automate and productionize the data pipelines
Migrating Map Reduce code into Spark transformations using Spark and Scala.
Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
Designed, documented operational problems by following standards and procedures using JIRA
Environment: Hadoop, Hive, Impala, Oracle, Spark, Pig, Sqoop, Oozie, Map Reduce, GIT, Confluence, Jenkins.
SunTrust, Atlanta,GA (Offshore) July 2016 – Dec 2018
Big Data/Hadoop Developer
Involved in writing Spark applications using Scala to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.
Load the data into Spark RDD and perform in-memory data computation to generate the output as per the requirements.
Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze operational data.
Developed Spark jobs, Hive jobs to summarize and transform data.
Worked on performance tuning of Spark applications to reduce job execution times.
Performance tuning the Spark jobs by changing the configuration properties and using broadcast variables.
Real time streaming the data using Spark with Kafka. Responsible for handling Streaming data from web server console logs.
Worked on different file formats like Text files, Avro, Parquet, JSON, XML files and Flat files using Map Reduce Programs.
Developed daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
Wrote Pig Scripts to generate transformations and performed ETL procedures on the data in HDFS.
Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and Aggregation and how does it translate to MR jobs.
Work with cross functional consulting teams within the data science and analytics team to design, develop and execute solutions to derive business insights and solve clients operational and strategic problems.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
Involved in collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
Assisted analytics team by writing Pig and Hive scripts to perform further detailed analysis of the data.
Designing Oozie workflows for job scheduling and batch processing.
Environment: Java, Scala, Apache Spark, MySQL, CDH, IntelliJ IDEA, Hive, HDFS, YARN, Map Reduce, Sqoop, PIG, Flume, Unix Shell Scripting, Python, Apache Kafka
IMI Mobile, Hyderabad Jan2015 – July 2016
•Worked on developing the application involving Spring MVC implementations and Restful web services.
•Developed code using Core Java to implement technical enhancement following Java Standards.
•Worked with Swing and RCP using Oracle ADF to develop a search application which is a migration project.
•Implemented Hibernate utility classes, session factory methods, and different annotations to work with back-end data base tables.
•Implemented Ajax calls using JSF-Ajax integration and implemented cross-domain calls using jQuery Ajax methods.
•Implemented Object-relational mapping in the persistence layer using Hibernate framework in conjunction with spring functionality.
•Used JPA (Java Persistence API) with Hibernate as Persistence provider for Object Relational mapping.
•Used JDBC and Hibernate for persisting data to different relational databases.
•Developed and implemented Swing, spring and J2EE based MVC (Model-View-Controller) framework for the application.
•Implemented application-level persistence using Hibernate and spring.
•Data Warehouse (DW) data integrated from different sources in different format (PDF, TIFF, JPEG, web crawl and RDBMS data MySQL, oracle, Sql server etc.)
•Used XML and JSON for transferring/retrieving data between different Applications.
•Also wrote some complex PL/SQL queries using Joins, Stored Procedures, Functions, Triggers, Cursors, and Indexes in Data Access Layer.
•Developed back-end interfaces using embedded SQL, PL/SQL packages, stored procedures, Functions, Procedures, Exceptions Handling in PL/SQL programs, Triggers.
•Used Log4j to capture the log that includes runtime exception and for logging info.
•Used ANT as build tool and developed build file for compiling the code of creating WAR files.
•Used Tortoise SVN for Source Control and Version Management.