PROFESSIONAL EXPERIENCE
AWS Data Engineer
PwC, Birmingham, Alabama, USA July 2024 - Present
PricewaterhouseCoopers, also known as PwC, is a multinational professional services network. Involved in designing, developing, and maintaining scalable and efficient data pipelines that process large volumes of data. Worked with technologies like Apache Spark, Hadoop, and Amazon Web Services (AWS) tools such as S3, Redshift, and EMR.
Environment: AWS, CI/CD, Cluster, Docker, DynamoDB, Elasticsearch, EMR, ETL, Jenkins, Kafka, Kubernetes, lake, Lambda, PL/SQL, S3, SAS, Snowflake, Spark, Spark Streaming, SQL, Sqoop, SSIS
Key Responsibilities:
Designed SSIS control flow tasks for orchestrating the sequence and logic of ETL processes, such as conditional branching, looping, and error handling. Implemented batch processing of streaming data using Spark Streaming.
The AWS Lambda functions were written in Spark with cross - functional dependencies that generated custom libraries for delivering the Lambda function in the cloud. Performed raw data ingestion into, which triggered a lambda function and put refined data into ADLS. Developed CI/CD system with Jenkins on Kubernetes environment, utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, Test and Deploy.
Experience in Monitoring System Metrics and logs for any problems adding, removing, or updating Hadoop Cluster.
Led requirement gathering, business analysis, and technical design for Hadoop and Big Data projects.
Configured Spark streaming to get ongoing information from the Kafka and store the stream information to DBFS.
Involved in the entire lifecycle of the projects including Design, Development, and Deployment, Testing and Implementation, and support. Worked on SQL and PL/SQL for backend data transactions and validations
Exported the analyzed data to the relational databases using Sqoop and to generate reports for the BI team.
Optimized batch and streaming migration processes by leveraging Dataflow and Pub/Sub, enabling near real-time synchronization of legacy systems with cloud infrastructure. Created data frames in Spark SQL from data in HDFS and performed transformations, analyzed data, and stored data in HDFS
Worked with Spark Core, Spark Streaming and Spark SQL modules of Spark for faster processing of data
Using Hadoop on Cloud service (Qubole) to process data in AWS S3 buckets and Imported data from AWS S3 into Spark RDD and performed actions/transformations on them. Developed PySpark based pipelines using Spark data frame operations to load data to EDL using EMR for jobs execution & AWS S3 as storage layer
Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in AWS. Implemented Continuous Delivery pipeline with Docker and GitHub
Worked with Lambda function to load Data into Redshift on arrival of csv files in S3 bucket. Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python
Integrated AWS SQS with ETL workflows to manage decoupled, event-driven processing of messages between microservices and data ingestion layers.
Responsible for data services and data movement infrastructures good experience with ETL concepts, building ETL solutions and Data modelling. Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
Integrated RESTful APIs to ingest data from external sources into cloud data pipelines, enabling real-time updates and ensuring data consistency across distributed systems.
Hands on experience on architecting ETL transformation layers and writing spark jobs to do processing
Involved in developing Hive DDLs to create, alter and drop Hive tables
Worked on different RDDs to transform data coming from different data sources and transform data into required formats
Created functions and assigned roles in AWS Lambda to run Python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI
Worked on container orchestration tools such as Docker swarm, Mesos, and Kubernetes. Documented and standardized Elasticsearch usage patterns, improving team onboarding speed and reducing query-related incidents by 40%.
Worked with AWS Terraform templates in maintaining the infrastructure as code.
Develop metrics based on SAS scripts on legacy system, migrating metrics to Snowflake (AWS S3).
Establishment of Data Catalog and shop-for-data feature in Data Governance tool. Utilized AWS EMR to transport data across databases and data storage inside AWS, including Amazon S3 and Amazon DynamoDB, efficiently.
GCP Data Engineer
(Ivy Comptech) Entain, Hyderabad, India Aug 2021 - Jul 2023
Entain plc, formerly GVC Holdings, is an international sports betting and gambling company. Implemented the data curation and metadata management. It includes ensuring the data is properly curated and managed, including the use of metadata to maintain data quality and governance.
Environment: Airflow, Apache, Azure, BigQuery, Cassandra, ELT, EMR, ETL, Factory, GCP, HDFS, Hive, JS, PaaS, PySpark, Python, S3, SDK, Spark, SQL, Sqoop, Tableau, VPC
Key Responsibilities:
Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
Analysed and developed a modern data solution with Azure PaaS service to enable data visualization. Understood the application's current Production state and the impact of new installation on existing business processes.
Created reusable views and data marts in BigQuery to power Data Studio reports with consistent metrics and definitions.
Designed and deployed scalable ETL pipelines using Google Cloud Dataproc, integrating PySpark and Hive to process over 5TB of raw data daily, reducing data transformation time by 40%.
Achieved 70% faster EMR cluster launch and configuration, optimized Hadoop job processing by 60%, improved system stability, and utilized Boto3 for seamless file writing to S3 bucket. Experienced in GCP features which include Google Compute engine, Google Storage, VPC, Cloud Load balancing, IAM.
Collaborated with product and analytics teams to design experiment-ready pipelines supporting AB testing across sports betting and gaming products.
Developed an end-to-end solution that involved ingesting sales data from multiple sources, transforming and aggregating it using Azure Databricks, and visualizing insights through Tableau dashboards. Leveraged Cloud Shell Editor for real-time troubleshooting of data jobs, improving developer velocity and incident response.
Developed Python Spark modules for Data ingestion & analytics loading from Parquet, Avro, and JSON data and from database tables. Strong experience in creating interactive and visually appealing dashboards and reports using Tableau, enabling stakeholders to gain insights from complex datasets.
Demonstrated skill in parameterizing dynamic SQL to prevent SQL injection vulnerabilities and ensure data security.
Involved in monitoring and scheduling the pipelines using Triggers in Azure Data Factory.
Designed and implemented Lakehouse architecture using Delta Lake on Databricks, integrating ADF and Azure SQL for a scalable and governed analytics solution
Designed and optimized ELT pipelines with Google BigQuery, handling datasets over 10TB with partitioning and clustering for performance. Automated deployment and management of GCP resources using Google Cloud SDK, streamlining infrastructure provisioning for data pipelines across Dataproc, BigQuery, and GCS.
Designed Cassandra schemas for time-series IoT data (500K writes/sec)
Used Sqoop import/export to ingest raw data into Google Cloud Storage by spinning up Cloud Dataproc cluster.
Used Google Cloud Dataflow using Python sdk for deploying streaming jobs in GCP as well as batch jobs for custom cleaning of text and Json files and write them to BigQuery. Involved in setting up of Apache Airflow service in GCP.
Data Engineer
Manipal Hospitals, Bangalore, India May 2019 - Jul 2021
Manipal Hospitals is a multi-specialty hospital chain in India that provides healthcare to both Indian and international patients. Designed the scalable and secure data storage and processing systems like data lakes, data warehouses. Designed, developed, and maintained ETL processes to support data integration from various sources like ERP systems, IoT sensors, and lab equipment.
Environment: Airflow, Apache, API, AWS, Cassandra, CI/CD, Cluster, Cosmos DB, Data Factory, DynamoDB, Elasticsearch, ETL, Factory, Flume, GCP, Git, HDFS, HDInsight, Hive, Jenkins, Kafka, Kubernetes, Lake, Lambda, Pig, Power BI, Python, S3, SAS, Scala, Spark, Spark SQL, SQL
Key Responsibilities:
Utilized Power BI Dataflows to streamline data preparation processes, enhancing the efficiency of ETL operations and ensuring consistent data across reports. Utilizing AWS Lambda and DynamoDB, a security framework was created to provide for fine-grained access control for items in AWS S3.
Leveraged Apache Flume for log aggregation by streaming application logs directly from servers into HDFS and Hadoop ecosystem. Analyzed data using Hadoop components Hive and Pig.
Have worked on partition of Kafka messages and setting up the replication factors in Kafka Cluster.
Utilized Scala and Spark-SQL / Streaming to create Spark code for accelerated data processing.
Designing the business requirement collection approach based on the project scope and SDLC methodology.
Involved in loading data into Cassandra NoSQL Database.
Build Jenkins jobs for CI/CD Infrastructure for GitHub repos.
Designed and deployed data pipelines using Data Lake, Data bricks, and Apache Airflow. Integrated Kubernetes with cloud-native services, such as AWS EKS and GCP GKE, to leverage additional scalability and managed services.
Designed and implemented Elasticsearch index schemas to support scalable, high-performance search and analytics over structured and unstructured data. Automated and monitored AWS infrastructure with Terraform for high availability and reliability, reducing infrastructure management time by 90% and improving system uptime.
Develop metrics based on SAS scripts on legacy system, migrating metrics to snowflake (AWS).
Python machine learning techniques were used to anticipate user order amounts for certain goods, with automated recommendations provided via Kinesis Firehose and S3 data lake.
*****************@*****.***
Hemanth Yadav
Data Engineer
Atlanta, GA, USA
ABOUT ME
EDUCATION
Proactive Data Engineer with 5+ years of experience designing, building, and optimizing scalable data pipelines and architectures. Proven ability to lead end-to-end data solutions across cloud platforms (AWS, Azure, GCP), big data ecosystems (Spark, Hadoop), and modern data warehouses (Snowflake, Redshift).
Auburn University at Montgomery, AL, USA
Masters / Management Information systems (Aug 2023 - May 2025)
Data engineer with 5+ years of experience architecting and implementing end-to-end data pipelines, ETL processes, and analytics solutions on cloud platforms like AWS and on-premises Hadoop.
Experience in Evaluation/design/development/deployment of additional technologies and automation for managed services on S3, Lambda, Athena, EMR, Kinesis, SQS, SNS, CloudWatch, Data Pipeline, Redshift, Dynamo DB, AWS Glue, Aurora DB, RDS, EC2. Designed, build and managed ELT data pipelines leveraging Airflow, Python, and GCP solutions.
Proficiency in building Data pipelines and Data loading using Azure Data Bricks and Azure Data Warehouse to control the accessibility to the database. Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
Working knowledge of HDFS, Kafka, MapReduce, Spark, PIG, Hive, Sqoop, HBase, Flume, and Apache ZooKeeper as tools for designing and deploying end-to-end big data ecosystems. Highly skilled in using visualization tools like Tableau, Matplotlib, ggplot2 for creating dashboards.
Hands-on experience with Spark, Databricks, and Delta Lake.
Expertise in developing production ready Spark applications utilizing Spark-Core, DataFrames, Spark-SQL, Spark-ML and Spark-Streaming API's.
Experience in Windows Azure Services like PaaS, IaaS and worked on storages like Blob (Page and Block), SQL Azure. Well experienced in deployment & configuration management and Virtualization.
Involved in all phases of Software Development Life Cycle (SDLC) in large scale enterprise software using Object Oriented Analysis and Design.
Architect & implement medium to large scale Bl solutions on Azure using Azure Data Platform services (Azure Data Lake, Azure Data Factory, Data Lake Analytics, and Stream Analytics). Hands-on experience in implementing, Building, and Deployment of CI/CD pipelines, managing projects often including tracking multiple deployments across multiple pipeline stages (Dev, Test/QA staging, and production).
Designed and developed ETL jobs for large scale data ingestion, transformation, cleansing, and migration between a variety of sources and sinks using PySpark, Scala, SQL, Kafka, Airflow. Experience in migrating on premise to Windows Azure using Azure Site Recovery and Azure backups.
Proficient in utilizing Kubernetes and Docker for designing and implementing data pipelines. Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
PROFESSIONAL SUMMARY
Big Data Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Ambari, Oozie, MongoDB, Cassandra, Mahout, Puppet, Avro, Parquet, Snappy, Falcon.
NO SQL Databases: Postgres, HBase, Cassandra, MongoDB, Amazon DynamoDB, Redis
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, and Apache.
Languages: Scala, Python, R, XML, XHTML, HTML, AJAX, CSS, SQL, PL/SQL, HiveQL, Unix, Shell Scripting
Source Code Control: GitHub, CVS, SVN, ClearCase
Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, CloudFront), Microsoft Azure, GCP
Databases: Teradata Snowflake, Microsoft SQL Server, MySQL, DB2
DB languages: MySQL, PL/SQL, PostgreSQL & Oracle
Build Tools: Jenkins, Maven, Ant, Log4j
Business Intelligence Tools: Tableau, Power BI
Development Tools: Eclipse, Intellij, Microsoft SQL Studio, Toad, NetBeans
ETL Tools: Talend, Pentaho, Informatica, Ab Initio, SSIS
Development Methodologies: Agile, Scrum, Waterfall, V model, Spiral, UML
RELEVANT SKILLS