Big Data Engineer

Location:

United States

Posted:

June 24, 2021

Contact this candidate

Resume:

Name: Pranav S

Phone:571-***-****

Email: ******.*****@*****.***

PROFESSIONAL SUMMARY:

7+ years of technical expertise in complete software development life cycle (SDLC), which includes 5+ years of Big Data Development.

Hands on experience working with Apache Spark and Hadoop ecosystems component including MapReduce, Sqoop, Hive, Oozie, Kafka, Impala and HBase.

Excellent knowledge on Spark internal architecture and various spark api’s for creating efficient spark applications.

Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala and Python.

Strong experience working with Spark for performing large scale data processing, data cleansing, data de-normalizations, data aggregations etc.,

Strong experience utilizing Spark RDD, Spark Data frames, Spark Sql and Spark Streaming Api’s extensively.

Strong experience troubleshooting long running spark applications, highly designing fault tolerant spark applications and fine-tuning spark applications.

Worked on building various data ingestion pipelines to pull data from various sources like S3 buckets, FTP servers and Rest Applications.

Strong knowledge building data lakes in AWS Cloud utilizing services like S3, EMR, Athena, Redshift, Redshift Spectrum, Glue Metastore etc.,

Strong experience working with various file formats like Avro, Parquet, Orc, Json, Csv etc.,

Experience in importing and exporting data using Sqoop from HDFS/Hive to Relational Database Systems and vice - versa.

Experienced in performing Data Analysis using Python NumPy and Pandas

Migrating Map Reduce code into Spark transformations using pySpark and Scala.

Strong experience building various data models in Hive and writing advanced hive scripts for various data analysis requirements.

Developed various spark applications using Scala to perform cleansing, transformation, and enrichment of these click stream data.

Worked extensively on Sqoop for performing both batch loads as well as incremental loads from relational databases.

Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.

Experience working with NoSQL databases like Cassandra and HBase.

Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.

Experience working with GitHub, Jenkins and Maven.

Strong in core Java concepts including Object-Oriented Design (OOD) and Java components like Collections Framework, Exception handling, I/O system.

Good understanding and Experience with Agile and Waterfall methodologies of Software Development Life Cycle (SDLC).

Highly motivated, self-learner with a positive attitude, willingness to learn new concepts and accepts challenges.

TOOLS AND TECHNOLOGIES:

Big Data Technologies

Spark, Kafka, Hive, HDFS, MapReduce, HBase, Impala, Hue, Cloudera Manager, Sqoop, Oozie

Programming Languages.

Java, Scala, HiveQL, Python, Shell Scripting.

Databases

Oracle, Teradata, MySQL, HBase, Cassandra

Build and CICD

Maven, Jenkins

Version Control Tools

Bit Bucket, SVN, GitHub.

Developer Tools

Eclipse, IntelliJ, Jupyter.

Professional Experience:

Client: Bayer, St Louis, MO

Big data Developer May 2020-Present

Responsibilities:

Worked on building centralized Data lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena.

Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud services.

Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.

Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.

Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.

Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.

Worked on writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala and pySpark.

Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.

Developed Spark scripts by using Python shell commands as per the requirement.

Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

Worked on automating the Infrastructure setup, launching and termination EMR clusters etc.,

Created Hive external tables on top of datasets loaded in AWS S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.

Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.

Worked on creating Kafka producers using Kafka Java Producer API for connecting to external Rest live stream application and producing messages to Kafka topic.

Environment: AWS S3, EMR, Lambda, Redshift, Athena, Glue, Spark, Scala, PySpark, Java, Hive, Kafka

Client: CITI Group, Tampa, FL

Data Engineer Nov 2018-April 2020

Responsibilities:

Involved in creating data ingestion pipelines for collecting Adobe Clickstream behavioral data from various external sources like FTP Servers and S3 buckets.

Involved in migrating existing Teradata Datawarehouse to AWS S3 based data lakes.

Involved in migrating existing traditional ETL jobs to pySpark and Hive Jobs on new cloud data lake.

Wrote complex spark applications for performing various de-normalization of the datasets and creating a unified data analytics layer for downstream teams.

Developed series of data ingestion jobs for collecting the data from multiple channels and external applications in Python.

Primarily responsible for fine-tuning long running spark applications, writing custom spark udfs, troubleshooting failures etc.,

Involved in building a real time pipeline using Kafka and Spark streaming for delivering event messages to downstream application team from an external rest-based application.

Involved in creating Hive scripts for performing ad hoc data analysis required by the business teams.

Worked extensively on migrating on prem workloads to AWS Cloud.

Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.

Implemented Spark using Python and Spark SQL for faster testing and processing of data.

Migrated Map-reduce jobs to Spark applications and integrated with Apache Phoenix and HBase.

Worked on writing different RDD (Resilient Distributed Datasets) transformations and actions using pySpark.

Worked on utilizing AWS cloud services like S3, EMR, Redshift, Athena and Glue Metastore.

Used broadcast variables in spark, effective & efficient Joins, caching and other capabilities for data processing.

Created private cloud using Kubernetes that supports DEV, TEST, and PROD environments.

Implemented testing environment for Kubernetes and administrated the Kubernetes Clusters. Deployed and Orchestrated the applications with Kubernetes.

Environment: AWS EMR, Lambda, IAM, Spark, PySpark, Hive, HDFS, Sqoop, Kafka, Oozie, HBase, Scala, MapReduce, Cloudera

Client: Info works, Palo Alto, CA

Hadoop Developer June 2017 to Sept 2018

Responsibilities:

Designed robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and semi structured batch and real time data streaming data.

Applied efficient and scalable data transformations on the ingested data using pySpark framework.

Gained good knowledge in troubleshooting and performance tuning Spark applications and Hive scripts to achieve optimal performance.

Developed various custom UDF’s in spark for performing transformations on date fields, complex string columns and encrypting PII fields etc.,

Written complex hive scripts for performing various data analysis and creating various reports requested by business stake holders.

Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python into PigLatin and HQL (HiveQL).

Used Oozie and Oozie Coordinators for automating and scheduling our data pipelines.

Worked extensively on migrating our existing on-prem data pipelines to AWS cloud for better scalability and infra structure maintenance.

Worked extensively in automating creation/termination of EMR clusters as part of starting the data pipelines.

Worked extensively on migrating/rewriting existing oozie jobs to AWS simple workflow.

Loaded the processed data into Redshift tables for allowing downstream ETL and Reporting teams to consume the processed data.

Good experience working on reporting tool Tableau for creating data sources, building reports, dashboards and performing data analysis.

Environment: AWS Cloud, Spark, Kafka, Hive, Yarn, HBase, Jenkins, Docker, Tableau, Splunk, Cloudera

Client: AIG, India

Hadoop Developer January 2016 to April 2017

Responsibilities:

Involved in importing and exporting data between Hadoop Data Lake and Relational Systems like Oracle, MySQL using Sqoop.

Involved in developing spark applications to perform ELT kind of operations on the data.

Modified existing MapReduce jobs to Spark transformations and actions by utilizing Spark RDDs, Data frames and Spark SQL API’s

Utilized Hive partitioning, Bucketing and performed various kinds of joins on Hive tables

Involved in creating Hive external tables to perform ETL on data that is produced on daily basis

Validated the data being ingested into Hive for further filtering and cleansing.

Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations

Loaded data into hive tables from spark and used Parquet columnar format.

Created Oozie workflows to automate and productionize the data pipelines

Migrating Map Reduce code into Spark transformations using Spark and Scala.

Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.

Designed, documented operational problems by following standards and procedures using JIRA

Environment: Hadoop, Hive, Impala, Oracle, Spark, PigLatin, Sqoop, Oozie, Map Reduce, GIT, Confluence, Jenkins.

Client: TMW Systems, Chennai, IN

Java Developer October 2013 – December 2015

Responsibilities:

Reviewed requirements with the support group and developed an initial prototype.

Involved in the analysis, design and development of the application components using JSP, Servlet’s components using J2EE design pattern.

Wrote Specification for the development.

Wrote JSPs, Servlets and deployed them on WebLogic Application server.

Implemented Struts framework based on the Model View Controller design paradigm.

Implemented the MVC architecture using Strut MVC.

Struts-Config XML file was created, and Action mappings were done.

Designed the application by implementing Struts based on MVC Architecture, simple Java Beans as a Model, JSP UI Components as View and Action Servlet as a Controller

Wrote Oracle PL/SQL Stored procedures, triggers, views for backend database access.

Used JSP’s HTML on front end, Servlets as Front Controllers and Java Script for client-side validations.

Participated in Server side and Client-side programming.

Wrote SQL stored procedures, used JDBC to connect to database.

Worked on triggers and stored procedures on Oracle database.

Worked on Eclipse IDE to write the code and integrate the application.

Communicated between different applications using JMS.

Extensively worked on PL/SQL, SQL.

Developed different modules using J2EE (Servlets, JSP, JDBC, JNDI).

Tested and validated the application on different testing environments.

Performed functional, integration and validation testing.

Environment: Java, J2EE, Struts, JSP, HTML, Servlets, Java Script, Rational Rose, SQL, PL-SQL, JDBC, MS Excel, UML, Apache Tomcat.

Education:

Bachelor of technology in Electronics & Computer Engineering at Sreenidhi Institute of Science and Technology. Jawaharlal Nehru Technological University, Hyderabad Aug 2009- May 2013

Contact this candidate