Bhargav .L
AWS Sr.Data Engineer Mobile:+1-469-***-****
Email: ************@*****.*** Location: Plano, TX
LinkedIn: www.linkedin.com/in/bhargav-raju-35b703274
SUMMARY:
Having around 11+ years of total IT experience with expertise knowledge in AWS Big Data Hadoop, with Development and Design for enterprise applications.
Extensive working experience on Hadoop eco-system components like HDFS, MapReduce, Hive, Sqoop, Flume, Spark, Kafka, Oozie and Zookeeper.
Implemented performance tuning techniques for Spark SQL queries.
Strong knowledge on Hadoop HDFS architecture, Map-Reduce (MRv1) and YARN(MRv2) framework.
Strong hands-on Experience in publishing the messages to various Kafka topics using Apache NIFI and consuming the message to HBase using Spark and Python.
Worked on creating Spark jobs that process the true source files and successful in performing various transformations on the source data using Spark Data frame, Spark SQL API's.
Developed Sqoop scripts to migrate data from Teradata, Oracle to Bigdata Environment.
Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Skilled in identifying and documenting data quality issues, data anomalies, and data inconsistencies during the ETL process, and collaborating with development teams to address and resolve them.
Hands-on experience with ETL testing tools such as Informatica PowerCenter, IBM DataStage, or Talend, including the configuration and execution of ETL jobs and workflows.
Involved in loading data from AWS S3 to Snowflake and processed data for further analysis.
Developed Analytical dashboards in Snowflake and shared data to downstream.
Experience in building data pipelines using Azure Data factory, Azure Databricks and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data warehouse and Controlling and granting database access.
Implemented real time data streaming pipeline using AWS Kinesis, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets.
Integrated AWS DynamoDB using AWS Lambda to store the value items and backup the DynamoDB streams.
Developed a Marketing Cloud Service on Amazon AWS. Developed serverless application using AWS Lambda, S3, Redshift and RDS.
Work on large scale data transfer across different Hadoop clusters, implement new technology stacks on Hadoop clusters using Apache Spark.
Experience in project deployment using Jenkins and using web services like Amazon Web Services (AWS) EC2, AWS S3, Autoscaling, CloudWatch and SNS.
Worked on data search tool Elastic Search and data sampling with Kibana.
Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data
from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tools.
Designed end to end scalable architecture to solve business problems using various Azure Components like
HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio. Exposed into methodologies Scrum, Agile and Waterfall.
TECHNICAL SKILLS:
Programming Languages
Scala, Python, SQL
Big Data Ecosystem
Hadoop, MapReduce, Kafka, PySpark, Pig, Hive, YARN, Flume, Sqoop, Oozie, Zookeeper, Talend.
Hadoop Distributions
Cloudera Enterprise, Data Bricks, Horton Works.
Databases
Oracle, SQL Server, PostgreSQL.
Streaming Tools
Kafka, RabbitMQ, Talend
Testing
Hadoop Testing, Hive Testing
Operating Systems
Linux Red Hat/Ubuntu/CentOS, Windows 10/8.1/7/XP.
Cloud
AWS EMR, Glue, Step Functions, CFT, RDS, CloudWatch, S3, Redshift Cluster, DynamoDB.
Technologies and Tools
Jenkins, Bit Bucket, GitHub, Bamboo.
Application Servers
Tomcat, JBoss.
IDE’s
PyCharm, IntelliJ.
PROFESSIONAL EXPERIENCE:
Sr.Data Engineer
Client: Nike, OR Aug 2023 to Jun 2025
RESPONSIBILITIES:
Build pipelines using Azure Data Factory (ADF) or Azure Synapse Pipelines to extract, transform, and load (ETL/ELT) data from various sources (on-prem, cloud, APIs).
Used Azure Databricks, Apache Spark, or Azure Synapse Analytics (SQL & Spark pools) for big data processing.
Utilized Databricks notebooks for interactive development, data exploration, and visualization of large-scale datasets.
Worked closely with BI teams using Power BI and enable semantic models.
Optimize data pipelines for performance and cost (compute/storage).
Communicate with stakeholders, data scientists, and analysts to understand requirements.
Designed and implemented robust, scalable data pipelines using Apache Airflow to orchestrate ETL workflows with complex dependencies and retry mechanisms.
Scheduled and monitored DAGs to ensure timely data availability and lineage tracking across different environments.
Developed and optimized data models in Snowflake (Star & Snowflake schema) for structured analytics consumption.
Built performant SQL queries, materialized views, and managed virtual warehouses for cost-effective compute operations.
Used Snowflake’s Time Travel, Zero-Copy Cloning, and external tables for versioning, recovery, and semi-structured data ingestion.
Integrated Amazon S3 for ingestion and archival of raw/processed datasets; automated lifecycle policies for storage optimization.
Developed AWS Lambda functions in Python for serverless data transformations, event-driven ingestion, and metadata updates.
Wrote scalable ETL logic using PySpark and Spark SQL, handling billions of records efficiently with transformations and aggregations.
Tuned Spark jobs with proper partitioning, caching, and memory management for performance optimization.
Built reusable, modular Python scripts for data transformation, validation, and pipeline monitoring.
Created custom SQL scripts for complex joins, window functions, and data quality checks ensuring clean and reliable datasets.
Implemented unit tests, data validations, and schema enforcement using Pytest and custom data checkers.
Monitored job success/failure using logging, alerting, and integrating with tools like CloudWatch or Airflow SLAs.
Collaborated with data analysts, data scientists, and product teams to gather requirements and deliver high-quality data solutions.
Maintained detailed technical documentation for pipelines, DAGs, and data models to ensure team-wide transparency and maintainability.
Environment: AWS EMR, S3, Lambda, Step Functions, Snowflake, CloudWatch, Hive, PySpark, Python, Snowflake, Tableau, Spark SQL, Azure, Databricks, Power BI.
Sr. Data Engineer
Client: Ditech Fort Washington, PA Dec 2018 to Apr 2023
RESPONSIBILITIES:
Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
Experience in developing Spark applications using Spark SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for Analyzing & Transforming the data to uncover insights into the customer usage patterns.
Developed PySpark Data Ingestion framework to ingest source claims data into HIVE tables by performing Data cleansing, Aggregations and applying De-dup logic to identify updated and latest records.
Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization.
Worked on developing CFT’s for migrating the infra from lower environment to higher environment.
Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.
Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS.
Experience in using the AWS services Athena, Redshift and Glue ETL jobs.
Proficient in test data management techniques, including the creation and maintenance of test data sets and test data environments for ETL testing activities.
Strong knowledge of SQL queries and database concepts to perform data validations, data comparisons, and data reconciliations between source systems and target data warehouses or data marts.
Strong understanding of data warehousing concepts and dimensional data models (star schema, snowflake schema), as well as experience in testing ETL processes involving multiple data sources and complex data transformations.
Involved in loading data from AWS S3 to Snowflake and processed data for further analysis.
Developed Analytical dashboards in Snowflake and shared data to downstream.
Worked on building data centric queries to cost optimization in Snowflake.
Good knowledge on AWS Services like EC2, EMR, S3, Service Catalog, and Cloud Watch.
Experience in using Spark SQL to handle structured data from Hive in AWS EMR Platform.
Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
Experienced in optimizing Hive queries, joins to handle different data sets.
Involved in creating Hive tables (Managed tables and External tables), loading and analyzing data using hive queries.
Actively involved in code review and bug fixing for improving the performance.
Good experience in handling data manipulation using python Scripts.
Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
Written unit test cases for Spark code for CICD process.
Good knowledge about configuration management tools like BitBucket/Github and Bamboo(CICD).
Environment: AWS EMR, S3, Lambda, Step Functions, CFT’s, SNS, SQS, Glue, Snowflake, CloudWatch, EMRFS, LINUX, Hive, PySpark, Python, Snowflake, Tableau, Spark SQL, Databricks.
Sr. Data Engineer
Client: Emblem Health - New York, NY. Dec 2016 to Oct 2018
RESPONSIBILITIES:
Worked with extensive data sets in Big Data to uncover pattern, problem & unleash value for the Enterprise.
Extract Transform and load data from source systems to Azure data storage services using a combination of Azure Data factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW)and processing the data in Azure Databricks.
Worked with internal and external data sources on improving data accuracy / coverage and generate recommendation on the process flow to accomplish the goal.
Data visualization: Pentaho, Tableau, D3. Have knowledge of Numerical optimization, Anomaly
Detection and estimation, A/B testing, Statistics, and Maple. Have big data analysis technique using Big data
related techniques i.e., Hadoop, Map Reduce, NoSQL, Pig/Hive, Spark/Shark, MLlib and Scala, NumPy,
SciPy, Pandas, scikit-learn.
Work on data that was a combination of unstructured and structured data from multiple sources and
automate the cleaning using Python scripts.
Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from
different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
Designed end to end scalable architecture to solve business problems using various Azure Components like
HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and
Azure Databricks
Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
Implement IOT streaming with Databricks Delta tables and Delta lake to enable ACID transaction logging
Exposed transformed data in Azure Spark Databricks platform to parquet formats for efficient data storage
Delivered de normalized data for Power BI consumers for modeling and visualization from the produced layer
in Data lake
Extracted and updated the data into HDFS using Sqoop import and export.
Utilized Ansible playbook for code pipeline deployment
Used Delta Lake as it is an open-source data storage layer which delivers reliability to data lakes.
Create custom logging framework for ELT pipeline logging using Append variables in Data factory
Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
Took proof of concept projects ideas from business, lead, developed and created production pipelines that
deliver business value using Azure Data Factory
Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing Developed
Apache Spark applications by using spark for data processing from various streaming sources.
Developed HIVE UDFs to incorporate external business logic into Hive script and Developed join data set scripts
using HIVE join operations.
Responsible to manage data coming from different sources through Kafka.
Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business value using Azure Data Factory.
Environment: Azure ADF, ADL’s, Spark, Kafka, Spark Streaming, Python, Databricks, Delta Lake, Synapse DB, Talend,ETL.
Sr. Data Engineer
Client: Cummins - Columbus, IN Feb 2015 to Nov 2016
RESPONSIBILITIES:
Experience with professional software engineering practices and best practices for the full software development life cycle including coding standards, code reviews, source control management and build processes.
Worked on analyzing Hadoop cluster using Ambari and different big data analytic tools including Spark and Hive.
Developed frameworks for data extraction, transformation and aggregation from multiple file formats including JSON, CSV & other compressed file formats.
Worked on Teradata parallel transport (TPT) to load data from databases and files to Teradata.
Wrote views based on user and/or reporting requirements.
Configured Flume source, sink and memory channel to handle streaming data from server logs and JMS sources.
Experience in working with Flume to load the log data from multiple sources directly into HDFS.
Worked in the BI team in Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
Involved in source system analysis, data analysis, data modeling to ETL (Extract, Transform and Load).
Handling structured and unstructured data and applying ETL processes.
Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa. Loading data into HDFS.
Implemented logging framework – ELK stack (Elastic Search, Log Stash& Kibana) on AWS.
Developed the Pig UDF’S to pre-process the data for analysis.
Coding complex Oracle stored procedures, functions, packages, and cursors for the client specific applications.
Experienced in using Java Rest API to perform CURD operations on HBase data.
Applied Hive queries to perform data analysis on HBase using Storage Handler to meet the business requirements
Writing Hive Queries to Aggregate Data that needs to be pushed to the HBase Tables.
Create/Modify shell scripts for scheduling various data cleansing scripts and ETL loading process.
Supports and assist QA Engineers in understanding, testing and troubleshooting.
Environment: Hadoop, Hive, Linux, Map Reduce, Sqoop, Storm, HBase, Flume, Eclipse, Maven, agile methodologies.
Data Engineer
Client: Linq Solutions, Hyderabad India July 2013 to Jan 2015
RESPONSIBILITIES:
Process Improvement: Analyzed error data of recurrent programs using Python and devised a new process to reduce the turnaround time of the problem’s solutions.
Worked on production data fixes by creating and testing of PL/SQL scripts.
Experienced developing SQL procedures on complex datasets for data cleaning and report automation.
Experienced in Data Extraction/Transformation/Loading Data Conversion and Data Migration using PL/SQL Scripts and SQL Server Integration Services (SSIS).
Skilled in the integration of various data sources with multiple relational databases such as Oracle, MS SQL Server, DB2, Teradata and Flat Files into the staging area, ODS, Data Warehouse and DataMart.
Acquired, cleaned and structured data from multiple sources and maintain databases/data systems.
Familiarity with Github for project management and versioning.
Strong programming skills in Python.
Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency.
Planned and conducted ETL and development tests; monitored results and took corrective action.
Troubleshoot ETL failures and performed manual loads using SQL stored procedures.
Deep dived into complex data sets to analyze trends using Linear Regression, Logistic Regression, Decision Trees.
Prepared reports using SQL and Excel to track the performance of websites and apps.
Developed Tableau workbooks to perform year over year, quarter over quarter, YTD, QTD and MTD type of analysis.
Analyzed the app’s data using Python and IBM SPSS Statistics.
Participated in Building Machine Learning using python.
Environment: Python, PL/SQL scripts, Oracle Apps, Excel, IBM SPSS, Tableau, QlikView
Education:
• Bachelor of Technology, Jawaharlal Technological University, Kakinada, India. With 69.66%.