Data Engineer Senior

Location:

Clinton, CT

Posted:

February 04, 2025

Contact this candidate

Resume:

Aniketh Moparthi

Senior Data Engineer

New Haven, Connecticut +1-475-***-****

***********@*****.*** LinkedIn.com

With over 10 years of experience in the Software Development Life Cycle (SDLC), I possess a robust skill set in cloud platforms including AWS, Azure, and GCP, with hands-on expertise in Databricks and big data technologies such as Hadoop, Spark, Kafka, and Hive. I am proficient in Java, Python, SQL, and Scala, with significant experience in ETL processes, data warehousing, and NoSQL databases like MongoDB and Cassandra. My technical acumen is complemented by strong interpersonal skills, adaptability, and a proven ability to work independently or collaboratively in dynamic environments. Work Experience

Senior Data Engineer May 2022 - Present

KeyBank Cleveland, Ohio

● Designing the business requirement collection approach based on the project scope and SDLC methodology.

● Involved in transformation of business problems into Big Data solutions and define Big Data strategy and Roadmap.

● Involved in design, development, configuring, deployment, and maintenance of multiple Data Pipelines.

● Worked on Pulling the data from data lake raw layer and massaging the data with various RDD transformations.

● Developed an ingestion framework to extract, parse and transform rest API data using spark RDD and Data frame APIs to ingest Data into Data Lake.

● Implemented data quality checks, sanity checks and data parsing processes using Spark RDD and Data frame APIs.

● Designed, implemented, and automated process to extract files from HDFS/BOX/STTP locations and synced S3 for downstream consumption on daily basis.

● Data profiling and data wrangling of XML, Web feeds, JSON parsing, file handling using python, Unix, and SQL.

● Authored Python (Pyspark) Scripts for custom UDF's for Row/Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.

● Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

● Involved in converting Map Reduce programs into Spark transformations using Spark python Data frame APIs.

● Worked on Performance tuning, code optimization and testing of spark applications that are running for long time.

● Worked on creating kakfa consumer with spark streaming APIs to source the streaming data and dump it to S3.

● Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data.

● Installed Ranger in all environments for Second Level of security in Hive while consuming the data.

● Integrated and Automated multiple ETL processes to run on schedule using Airflow.

● Developed AWS Lambda application to parse and flatten the Json data as they land on s3, Configured Lambda on S3 buckets.

● Developed Aggregated ETL processes to refresh the data on weekly basis and write it to Redshift for Tableau Reporting consumption.

● Designed and implemented Sqoop for the incremental job to read data from SQL Server and load to HDFS and Hive.

● Worked with Data analysts and stake holders, conducted root cause analysis and resolved production problems and data issues.

● Environment: AWS, EC2, S3, EMR, Lambda, Cloud Watch, Auto Scaling, Redshift, Jenkins, Spark, Hive, Athena, Sqoop, Airflow, Spark Streaming, Hue, Scala, Python, GIT, Micro Services. Senior Data Engineer Oct 2019 - May 2022

Humana Louisville, Kentucky

● Design, build and maintain scalable reusable big data frameworks to support data foundation layers for analytical consumption.

● Designed and implemented data processing pipelines for supporting large-scale data analytics using Azure Data Lake and Databricks.

● Developing pipeline using Azure data factory to move the batch data from Azure Blob to Azure Cosmos DB to expose the data to downstream applications.

● Developed and maintained ETL processes using Azure Data Factory and Azure Databricks to extract, transform, and load data from various sources into the azure data lake.

● Worked on creating data quality checks and monitoring processes to ensure data accuracy and consistency.

● Optimized data processing pipelines for performance and scalability by tuning SQL queries and optimizing Databricks cluster configurations.

● Built and maintained real-time data processing pipelines using Azure Databricks to support streaming data ingestion and processing into the data lake.

● Implemented delta lake on top of azure data lake for incremental pipelines optimizing the overall ETL processing time and unified streaming, batch data processing.

● Handled importing of data from various data sources, performed transformations using Hive, and loaded data into data lake.

● Experienced in handling large datasets using Partitions, Spark in Memory capabilities.

● Processed data stored in data lake and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.

● Developed data flows and processes for the Data processing using SQL Spark SQL& Data frames.

● Designed and developed Map Reduce (hive) programs to analyze & evaluate multiple solutions, considering multiple cost factors across the business as well operational impact.

● Involved in planning process of iterations under the Agile Scrum methodology.

● Developed Airflow DAGs in python. Orchestrated multiple spark ETL pipelines to run Azure HDInsight and Databricks.

● Worked on Scheduling Spark jobs using Databricks workflow and Azure Data Factory.

● Developed CICD pipelines using GitHub Actions for systemic automated deployments.

● Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Swamp, Azure, Azure HD Insight, Spark SQL, Teradata, Spark Streaming, Hive, Scala, pig, Azure Data Bricks, Azure Data Storage, Azure Data Lake, Azure SQL, Git Hub.

Data Engineer Mar 2017 - Sep 2019

Liberty Mutual Insurance - Boston

● Experience in Big Data Analytics and design in Hadoop ecosystem using Map Reduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka

● Installed Ranger in all environments for Second Level of security in Kafka Broker Conduct performance analysis and optimize data processes.

● Make recommendations for continuous improvement of the data processing environment Conduct performance analysis and optimize data processes.

● Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting, and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and SQL.

● Loading data from different sources to a data warehouse to perform some data aggregations for business Intelligence using python.

● Loading salesforce Data every 15 min on incremental basis to BIGQUERY raw and UDM layer using SOQL, Google Dataproc, GCS bucket, HIVE, Spark, Scala, Python, Gsutil and Shell Script.

● Using rest API with Python to ingest Data from and some other site to BIGQUERY.

● Hands a lot of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.

● Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.

● Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

● Involved in creating Hive tables, loading the data, and writing Hive queries that will run internally in a map-reduced way. Developed a custom File System plugin for Hadoop so it can access files on the Data Platform.

● Installed and configured Pig and written Pig Latin scripts.

● Designed and implemented MapReduce-based large-scale parallel relation-learning system. Setup and benchmarked Hadoop/HBase clusters for internal use

● Data validation and cleansing of staged input records were performed before loading into Data Warehouse

● Automated the process of extracting various files like flat/excel files from various sources like FTP and SFTP (Secure FTP).

● Environment: Scala, GCP, Big Query, Data Proc, HDFS, Yarn, MapReduce, Hive, Sqoop, Spark SQL, Spark Streaming.

Data Engineer Jun 2014 - Dec 2016

Geojit Financial Services India

● Responsible for building scalable distributed data solutions using Hadoop.

● Demonstrated a strong comprehension of project scope, data extraction, design of dependent and profile variables, logic and design of data cleaning, exploratory data analysis and statistical methods.

● Used spark steaming APIs to perform necessary transformations for building the common learner data model which gets data from Kafka in near real time and persists into Hive.

● Developed Spark scripts by using Python as per the requirements.

● Developed real time data pipeline using Spark to ingest customer events/activity data into Hive and Cassandra from Kafka.

● Performed Spark jobs optimization and performance tuning to improve running time and resources.

● Worked on reading and writing multiple data formats like JSON, AVRO, Parquet, ORC on HDFS using Pyspark.

● Designed, developed, and did maintenance of data integration in Hadoop and RDBMS environment with both traditional and non-traditional source system as well as RDBMS and NoSQL data stores for data access and analysis.

● Involved in recovery of Hadoop clusters and worked on cluster size of 310 nodes.

● Worked on creating Hive tables, loading, and analyzing data using Hive queries.

● Experience in providing application support for Jenkins.

● Developed a data pipeline with AWS to extract the data from weblogs and store it in HDFS.

● Used Hive QL to analyze the partitioned and bucketed data and compute various metrics for reporting.

● Used reporting tools like Tableau and Power BI to connect with Hive for generating daily reports of data.

● Environment: HDFS, SSIS, Hadoop, Hive, HBase, MapReduce, Spark, Sqoop, Pandas, MySQL, SQL Server, Teradata, Java, Unix, Python, Tableau, Git, Linux/Uni Core Skills

Hadoop Eco System, Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Airflow, Kafka, Snowflake, PL/SQL, SQL, Python, Scala, Java., Data Bases, MySQL, SQL Server, Oracle, MS Access, NoSQL Data Bases, Cassandra, HBase, Dynamo DB., Workflow mgmt. tools, Apache Airflow, Oozie, Autosys, workflows, Visualization & ETL tools, Tableau, Power BI, Informatica, Talend, Cloud Technologies, AWS, Databricks, Azure, IDE's, Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ, Version Control Systems Git, Bitbucket

Education

Jawaharlal Nehru technological University, Hyderabad 2014 Bachelor of Computer Science

Contact this candidate