Senior Data Engineer with Cloud & ETL Expertise

Location:

Ashburn, VA

Posted:

March 20, 2026

Contact this candidate

Resume:

Bhavani

Data Engineer

*****************@*****.*** +1-571-***-****

PROFESSIONAL SUMMARY

Data Engineer with 8+ years of experience building scalable data pipelines, ETL frameworks, and distributed data processing systems using Python, PySpark, SQL, and cloud platforms (AWS/Azure/GCP). Experienced in data integration, schema design, and large-scale data transformation pipelines supporting enterprise analytics platforms. Skilled in Spark-based data processing, data modeling, and CI/CD workflows using Git with strong expertise in data governance, data lineage, and cloud-based data architectures.

• Developed end-to-end batch and streaming data pipelines using PySpark and Delta Lake on Databricks, enabling high-performance data processing and reliable data workflows.

• Built and optimized cloud-based data integration pipelines using Azure Data Factory, Azure Databricks, and AWS Glue to support scalable data ingestion and transformation.

• Designed enterprise cloud data lake architectures and data models using Azure Data Lake Storage (ADLS), AWS S3, Azure Synapse Analytics, Amazon Redshift, and Big Query to support analytics and reporting.

• Migrated critical datasets from on-premises systems (Oracle, MySQL, Teradata) to cloud data lakes (ADLS, AWS S3, GCS) using Sqoop, AWS Database Migration Service (DMS), Cloud Data Fusion, and Python automation.

• Worked extensively with Azure Data Lake, Azure SQL Database, Azure Synapse, Azure Blob Storage, AWS S3, and Amazon RDS for scalable data storage and analytical processing.

• Implemented Data Lake and Delta Lake architectures integrating large-scale datasets using Azure Functions, AWS Lambda, Cosmos DB, DynamoDB, MongoDB, and Elasticsearch.

• Performed advanced data processing and performance tuning on Oracle, Azure SQL Database, Amazon RDS, and PostgreSQL using indexing, query optimization, and stored procedures.

• Designed and maintained Hive tables and dimensional data models (Star Schema) with partitioning and performance optimization for large datasets.

• Utilized Python libraries including PySpark and NumPy to build automated frameworks for data ingestion, transformation, and workflow automation.

• Automated complex data pipelines using Apache Airflow (AWS MWAA) with Python DAGs supporting scheduling, monitoring, and alerting.

• Built enterprise-scale cloud data platforms using Terraform, AWS CloudFormation, Jenkins, and Git to enable Infrastructure-as-Code deployments.

• Optimized and debugged distributed Spark workloads across Databricks, AWS EMR, and Azure Databricks to improve performance and cost efficiency.

• Implemented data quality and cleansing processes using Informatica Data Quality (IDQ) to improve enterprise data consistency.

• Developed modular and reusable Python components for data ingestion, transformation, API integration, and pipeline orchestration.

• Provisioned cloud infrastructure using Terraform modules for networking, compute resources, and Databricks/EMR clusters.

• Integrated real-time streaming pipelines using Apache Kafka, AWS Kinesis, and Azure Event Hubs.

• Built big data platforms using Hadoop ecosystem tools including HDFS, Hive, HBase, Sqoop, and Apache NiFi.

• Modernized legacy ETL pipelines by migrating Informatica and SSIS workflows to Spark/Databricks and AWS Glue architectures.

• Migrated SQL Server SSIS workflows to Databricks and AWS Glue using Spark SQL and notebook-based pipelines.

• Integrated Azure Databricks and AWS analytics services with ADLS, S3, Synapse, Redshift, Azure SQL, and Cosmos DB for end-to-end data solutions.

• Authored technical documentation, architecture diagrams, and data flow specifications for engineering teams and stakeholders.

• Implemented CI/CD pipelines using Git, Jenkins, and AWS CodePipeline to support automated deployments.

• Orchestrated enterprise data pipelines using Apache Airflow and Apache Oozie.

• Worked extensively with relational databases (Oracle, MySQL, PostgreSQL, SQL Server, Amazon Aurora) and NoSQL databases (MongoDB, Cassandra, HBase, Cosmos DB, DynamoDB).

• Developed HiveQL and Pig scripts for large-scale data transformations and analytics.

• Utilized SQL, Python, PySpark, and Scala for scalable data engineering solutions.

• Collaborated in Agile/Scrum environments using JIRA and Confluence for sprint planning, documentation, and project tracking.

Technical Skills

Languages

Python, SQL, PySpark, Shell Scripting, HiveQL

Big Data & Processing

Apache Spark (PySpark, Spark SQL, DataFrames, RDD), Hadoop (HDFS, YARN), Hive, Kafka, Sqoop, HBase, Apache NiFi

Cloud Platforms

AWS (S3, Glue, EMR, Lambda, Redshift, RDS, DynamoDB, Kinesis, Lake Formation, CloudFormation, MWAA, IAM, CloudWatch)

Azure (Azure Data Factory, Azure Synapse Analytics, Azure Data Lake Storage Gen2, Azure SQL Database, Azure Event Hubs)

GCP (BigQuery, Dataproc, Cloud Data Fusion)

Databases

Oracle, Amazon Aurora, MongoDB, MySQL, PostgreSQL, Teradata, SQL Server

Data Lake Technologies

Delta Lake, Azure Data Lake Storage Gen2, AWS S3, AWS Lake Formation

Big Data Ecosystem

Apache Spark (PySpark, Spark SQL, DataFrames, RDD), Hadoop (HDFS, YARN), Hive, Apache Kafka, Sqoop, HBase, Apache NiFi

Workflow &Data Integration

Apache Airflow (MWAA), Azure Data Factory (ADF), Apache Kafka, Azure Event Hubs, REST APIs, Azure Logic Apps

Data Warehousing & Modeling

Amazon Redshift, Azure Synapse Analytics, Azure SQL Data Warehouse, Oracle, MySQL, SQL Server.

Star Schema, Snowflake Schema

ETL & Data Engineering

PySpark, Apache Spark, Databricks, Azure Data Factory (ADF), AWS Glue, SSIS,

Informatica Data Quality (IDQ), Informatica Cloud, Cloud Data Fusion

Data Governance & Quality

Data Cataloging, Data Mapping, Master Data Management (MDM),

Data Quality Management, Schema Evolution

Development

Git, Jenkins, CI/CD Pipelines, Terraform, Agile/Scrum, JIRA, Confluence

PROFESSIONAL EXPERIENCE

Amtrak Washington DC,USA Feb 2023 – Present

Senior Data Engineer

Designed scalable real-time data pipelines using Apache Spark, AWS Kinesis, and Flume to ingest data from multiple sources and enable near real-time analytics.

Designed and maintained ETL pipelines using Azure Data Factory, Azure SQL Database, and Azure Blob Storage for scalable cloud-based data integration.

Developed and optimized cloud-native ETL pipelines using AWS S3, AWS Glue, EMR, and Lambda to process large volumes of structured and semi-structured data.

Built distributed data processing pipelines using PySpark on AWS EMR, improving scalability and performance of large-scale data workloads.

Designed and maintained cloud data warehouse solutions using Snowflake and Amazon Redshift to support enterprise reporting and advanced analytics.

Implemented serverless event-driven workflows using AWS Lambda and Step Functions to automate data transformations and reduce infrastructure management.

Implemented monitoring and logging frameworks using Elasticsearch, Kibana, and Amazon CloudWatch to improve pipeline reliability and troubleshooting.

Developed machine learning models using Python and PySpark, including LSTM-based time-series forecasting for predictive analytics.

Developed a neural network-based housing price prediction model achieving approximately 0.80 explained variance using feature engineering and exploratory data analysis.

Automated data ingestion pipelines using Fivetran by configuring incremental loads, schema mapping, and scheduled data synchronization.

Developed workflow orchestration pipelines using Apache Airflow (AWS MWAA) for scheduling, pipeline backfills, and workflow automation.

Implemented high-performance ETL transformations using Spark SQL and loaded curated datasets into Snowflake for analytics and business intelligence.

Designed dimensional data models (Star and Snowflake schemas) to support OLAP reporting and business intelligence.

Developed scalable data integration and ingestion pipelines using Python, PySpark, and SQL to process data from APIs, ERP systems, S3, and relational databases into cloud-based data platforms.

Designed and optimized distributed data transformation workflows using PySpark and Spark to process large enterprise datasets and support high-performance analytics.

Built and maintained data models using star and snowflake schema designs, ensuring efficient query performance and scalable data warehousing solutions.

Implemented CI/CD pipelines with Git for version control, while establishing data governance frameworks including data lineage, monitoring, and data quality validation across data platforms.

Developed internal data-driven applications using Python and Django to improve data accessibility for business users.

Collaborated in Agile development environments using CI/CD practices to deliver reliable and scalable data solutions.

HSBC Hyderabad, India Dec 2020 – Jan 2023

Data Engineer

Designed and implemented scalable data processing pipelines using PySpark and Spark SQL to support large-scale enterprise analytics.

Developed PySpark benchmarking programs to evaluate Spark performance against HiveQL workloads and optimize processing efficiency.

Built and optimized PySpark DataFrame transformations in Azure Databricks, ingesting and transforming data from Azure Data Lake Storage (ADLS) and Azure Blob Storage.

Developed Python and PySpark-based ETL pipelines for financial data processing and reporting systems.

Built SQL-based transformations and optimized database queries for high-volume financial datasets.

Designed data warehouse schemas and dimensional models to support financial analytics platforms.

Implemented data ingestion pipelines integrating transactional systems and enterprise databases.

Processed large-scale financial datasets using distributed Spark processing frameworks.

Worked in Agile development environments with Git version control and automated CI/CD pipelines.

Automated ETL workflows using Azure Data Factory (ADF) by integrating Databricks and ADLS to enable efficient pipeline orchestration.

Integrated Azure Data Lake Storage with Azure Databricks to enable scalable data processing and advanced analytics workloads.

Developed real-time data ingestion pipelines using Apache Spark and Kafka Streams to process large volumes of financial transaction data.

Built automated data validation and reconciliation scripts to proactively identify data discrepancies and reduce QA-reported defects.

Developed and managed Apache Airflow workflows in GCP using multiple operators to orchestrate enterprise data pipelines.

Migrated datasets from MySQL and Teradata to Google Cloud Storage (GCS) using Python automation and Cloud Data Fusion pipelines.

Integrated external APIs with GCP Cloud Functions to retrieve and cache real-time financial data for banking analytics dashboards.

Consolidated data from multiple heterogeneous sources into Snowflake, transforming complex nested JSON datasets for analytical reporting.

Collaborated with QA teams to analyze and resolve data discrepancies identified during UAT and regression testing.

Applied Spark performance optimization techniques including partitioning, broadcast joins, caching, and in-memory computation.

Configured GCP services such as Dataproc, Cloud Storage, and BigQuery using Cloud Shell SDK for scalable analytics infrastructure.

Managed Infrastructure as Code using Terraform across Azure and GCP environments for consistent and automated deployments.

Implemented centralized monitoring using Azure Log Analytics and automated infrastructure configuration using Terraform.

Developed and maintained Apache Airflow DAGs in Python with SLA monitoring, scheduling, and workflow automation.

Migrated legacy SSIS-based ETL workflows to Azure Databricks, improving scalability and maintainability of data pipelines.

Maintained and optimized SSIS packages using Visual Studio to enhance enterprise ETL reliability.

Developed shell scripts for automation, log monitoring, and alerting, scheduling batch workflows using crontab.

Designed dimensional data warehouses using star schema modeling, optimizing performance through indexing, partitioning, and constraints.

Carrier Global Hyderabad, India May 2018– Sep 2020

Data Developer

Worked extensively in the Big Data ecosystem (Hadoop, HDFS, Hive, Spark, PySpark, Kafka, and AWS EMR) to build scalable batch and streaming data pipelines.

Processed and analyzed web log data stored in Amazon S3 using EC2, Kafka, and HDFS, improving reporting data availability and reducing processing time by 30%.

Developed Python automation scripts to convert XML data into JSON format and integrated DynamoDB for task tracking and error logging, improving data accuracy by 25%.

Built ETL workflows for data ingestion and transformation between Amazon S3 and DynamoDB, improving operational efficiency by 40%.

Implemented real-time streaming pipelines using AWS Kinesis Firehose and S3 to automate product recommendation updates.

Leveraged AWS EMR for large-scale data transformations across S3, DynamoDB, and Snowflake to support scalable analytics workloads.

Automated data validation and error tracking using AWS serverless workflows and DynamoDB, improving data quality and pipeline reliability.

Delivered real-time analytics by integrating streaming data pipelines with machine learning predictions.

Optimized data pipeline performance by implementing partitioning strategies for historical and incremental datasets.

EDUCATION

BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE & ENGINEERING

Sangai International University Manipur, India

Contact this candidate