AWS Data Engineer - Spark, ETL, Redshift, Glue, Airflow

Location:

Pflugerville, TX

Salary:

120000

Posted:

December 11, 2025

Contact this candidate

Resume:

SANJAY BHARGAV KOMMA

Email: *************@*****.***

Ph: +1-469-***-****

PROFESSIONAL SUMMARY:

• Over 7+ years of experience as an AWS Data Engineer, specializing in designing, developing, and optimizing scalable data pipelines and cloud-native ETL solutions.

• Strong expertise in end-to-end Data Engineering lifecycle including requirement gathering, design, development, deployment, and support in Agile/Scrum environments.

• Hands-on experience with AWS services including Glue, Redshift, Athena, EMR, S3, Step Functions, Lambda, DynamoDB, RDS, and CloudWatch for large-scale data engineering workloads.

• Proficient in Python (Pandas, NumPy, PySpark) and Scala for data transformation, automation, and building reusable frameworks.

• Solid experience in Apache Spark, Spark SQL, and PySpark for distributed data processing, ETL, and real-time analytics.

• Skilled in data ingestion frameworks using AWS Glue, Kafka, Kinesis, and Fivetran to integrate structured, semi- structured, and unstructured data.

• Developed and deployed data lake architectures on AWS S3 integrated with Glue Catalog and Athena for cost-efficient analytics.

• Expertise in data warehousing with Amazon Redshift, implementing dimensional models, star schemas, snowflake schemas, and fact/dimension tables.

• Strong experience in SQL optimization, query performance tuning, and partitioning strategies for large datasets.

• Implemented CI/CD pipelines with Jenkins, Git, and AWS CodePipeline to automate ETL deployments and manage infrastructure as code using Terraform/CloudFormation.

• Designed and optimized data pipelines for real-time and batch processing, ensuring high availability, scalability, and fault tolerance.

• Proficient in workflow orchestration using Apache Airflow and AWS Step Functions for scheduling, monitoring, and automation of data workflows.

• Hands-on experience with data governance, metadata management, and data quality frameworks to ensure accuracy, reliability, and compliance.

• Worked extensively with NoSQL databases like DynamoDB, MongoDB, and Cassandra for real-time, high-volume transaction processing.

• Collaborated with cross-functional teams including data scientists, analysts, and product owners to deliver data-driven insights and business intelligence.

• Experience in data migration projects, moving on-premise data warehouses and legacy systems to AWS cloud-native architectures.

• Strong knowledge of data modeling techniques (normalized, denormalized, dimensional) and performance tuning strategies for big data platforms.

• Exposure to big data ecosystems like Hadoop, Hive, HBase, Sqoop, and Zookeeper for distributed data storage and processing.

• Skilled in monitoring, troubleshooting, and optimizing pipelines using AWS CloudWatch, Glue job logs, and EMR monitoring tools.

• Passionate about leveraging cloud-native solutions and open-source frameworks to build robust, future-proof data platforms supporting advanced analytics and machine learning. TECHNICAL SKILLS:

Programming Languages: Python, Scala, Java, C, C++, SQL, Ruby, R. Cloud Platforms: Amazon Web Services (AWS).

AWS: S3, DynamoDB, CloudWatch, CloudTrail, IAM, EMR, EC2, Kinesis, SNS, SQS, Aurora, Lambda, Athena, Redshift, Step Functions, Glue, Quick Sight, Elasticsearch, CloudFront, CloudFormation, Code Pipeline. Data Warehouse: Snowflake, Redshift, Databricks, Data Lake, Oracle, SQL Server, Teradata. Operating Systems: Windows, Linux, Unix, Mac.

Hadoop & Big Data

Technologies:

PySpark, Spark SQL, Hadoop, HDFS, MapReduce, YARN, Kafka, Sqoop, Hive, Pig, HBase, Druid, Apache Nifi.

ETL Tools: Informatica, Talend, Apache Airflow, Apache Oozie Databases: Hadoop, Oracle, MySQL, SQL Server, and PostgreSQL, Snowflake, AWS Redshift, Cassandra

IDEs: PyCharm, Visual Studio Code, SSMS, Data Studio, IntelliJ, Eclipse, PL/SQL Developer, Teradata, Oracle, Spyder, PyScripter, PyStudio, PyDev, IDLE, NetBeans, Jupyter Notebook.

Version Control & CI/CD: Git, GitLab, GitHub, Jenkins, Maven, Docker, Kubernetes, Hudson, Bamboo, Gradle, SVN, CVS, Terraform.

No SQL Databases: MongoDB, Cassandra, HBase, Cosmos DB, DynamoDB, MariaDB. Data Visualization: Tableau, Power BI.

PROFESSIONAL EXPERIENCE:

Client: KeyBank, Cleveland, OH May 2023 – Present

Role: Senior AWS Data Engineer

Responsibilities:

• Designed and implemented scalable ETL/ELT pipelines using AWS Glue, Lambda, and PySpark to process structured and unstructured data from diverse sources into centralized data lakes.

• Built and optimized data lakes on Amazon S3, ensuring efficient data partitioning, compression, and lifecycle policies for cost optimization and performance.

• Developed real-time data streaming architectures using Amazon Kinesis, Kafka, and AWS Lambda for event-driven analytics and fraud detection systems.

• Implemented data warehouse solutions using Amazon Redshift, integrating with Glue Catalog and Spectrum for seamless data querying and transformation.

• Automated infrastructure provisioning using Terraform and CloudFormation, reducing manual errors and ensuring consistent, version-controlled deployments.

• Optimized Spark jobs on AWS EMR, leveraging partitioning, bucketing, and caching strategies to improve runtime performance and scalability for large datasets.

• Created CI/CD pipelines using Jenkins and CodePipeline, enabling automated data pipeline testing, deployment, and monitoring across multiple environments.

• Integrated data from relational (MySQL, PostgreSQL, Oracle) and non-relational (MongoDB, DynamoDB) sources using AWS DMS and custom Python-based ingestion frameworks.

• Designed event-driven architectures using AWS SQS and SNS to support asynchronous communication and trigger- based workflows in complex data ecosystems.

• Built serverless data processing workflows leveraging AWS Lambda, Step Functions, and Glue Jobs for scalable, cost- efficient data transformation.

• Developed custom data quality and validation frameworks using Python, Pandas, and AWS Glue to ensure high data accuracy and consistency across environments.

• Implemented robust monitoring and alerting systems using Amazon CloudWatch, CloudTrail, and SNS for pipeline health, latency, and anomaly detection.

• Performed schema evolution and data cataloging using AWS Glue Data Catalog, ensuring proper governance and discoverability of datasets across teams.

• Collaborated with cross-functional teams (data scientists, BI, and DevOps) to build data models, optimize pipelines, and deliver real-time analytical dashboards.

• Migrated legacy on-premise ETL workflows to AWS Cloud, achieving a 40% reduction in data processing time and improved fault tolerance.

• Ensured data security and compliance by implementing IAM roles, encryption (KMS), and VPC-based access controls following AWS best practices.

• Developed and deployed PySpark-based transformation logic to handle complex business rules and analytics-ready data sets in production.

• Created and managed Redshift Spectrum external tables to query S3 data directly, optimizing reporting performance and minimizing data duplication.

• Implemented Airflow DAGs and schedulers for orchestration, ensuring reliable execution and monitoring of daily data ingestion and transformation workflows.

• Contributed to performance tuning, cost optimization, and automation initiatives, leading to improved efficiency and reduced AWS operational expenses.

Client: Pfizer, New York, NY August 2020 – April 2023 Role: Senior AWS Data Engineer

Responsibilities:

• Designed and developed end-to-end ETL and ELT pipelines on AWS using Glue, Lambda, Step Functions, and Redshift to ingest, transform, and load structured and semi-structured clinical and R&D datasets.

• Architected a HIPAA-compliant AWS Data Lake on S3 with Raw, Staging, and Curated zones, implementing encryption, version control, and lifecycle management policies to ensure scalability and cost optimization.

• Developed and optimized PySpark transformation jobs on EMR for high-volume genomic and patient data; leveraged partitioning, bucketing, and caching techniques to reduce data-processing runtime by 40%.

• Built data-validation and schema-enforcement frameworks in AWS Glue using Python and DynamicFrames to guarantee accuracy, consistency, and compliance with FDA 21 CFR Part 11 and GxP standards.

• Integrated AWS Glue Data Catalog and Lake Formation for automated schema discovery, metadata tracking, access control, and end-to-end data lineage visibility.

• Developed Python-based ETL automation scripts and data-quality tests, embedding them in CI/CD pipelines managed with Jenkins and CodePipeline to achieve continuous integration and deployment.

• Provisioned and managed cloud infrastructure with Terraform and CloudFormation, ensuring consistency across development, QA, and production environments.

• Configured CloudWatch dashboards, SNS notifications, and EventBridge triggers for proactive monitoring, job orchestration, and anomaly detection across all pipeline components.

• Optimized storage and query performance by adopting Parquet and ORC file formats, improving compression ratios and reducing query costs by 35%.

• Migrated legacy Informatica and SQL Server ETL workflows to AWS Glue, creating event-driven, serverless pipelines that lowered operational costs by 30%.

• Implemented data-lineage tracking and metadata-driven architecture, ensuring traceability from raw ingestion to final analytical datasets for regulatory audits.

• Partnered with regulatory, QA, and compliance teams to prepare data-governance documentation, SOPs, and traceability matrices supporting audit readiness.

• Integrated AWS Lake Formation, Glue Catalog, and CloudTrail logs for automated access logging and audit-trail generation, meeting all FDA inspection requirements.

• Documented data models, process flows, architecture diagrams, and dependency maps in Confluence and Lucidchart for enterprise knowledge sharing.

• Enabled secure cross-account S3 data sharing through AWS RAM, improving collaboration between research and QA divisions.

• Delivered compliance-ready, automated data pipelines that improved data availability, audit traceability, and reporting turnaround time for global R&D operations.

• Overall, achieved 45 % faster processing, 60 % less manual reprocessing, and 100 % regulatory compliance, accelerating Pfizer’s clinical-data analytics and decision-making lifecycle. Client: NextEra Energy, Juno Beach, Florida May 2018 – July 2020 Role: Big Data Engineer

Responsibilities:

• Designed and developed robust ETL pipelines for ingesting, transforming, and loading large-scale structured and unstructured data.

• Built and optimized cloud-based data warehouses and data lakes on AWS for scalable analytics.

• Implemented real-time streaming pipelines using Apache Kafka, Kinesis, or Spark Streaming for low-latency data processing.

• Engineered batch and micro-batch processing frameworks using Apache Spark, PySpark, and Hadoop for high-volume datasets.

• Developed data models using star, snowflake, and normalized schemas to support reporting and analytics.

• Automated workflow orchestration and scheduling using Apache Airflow, Oozie, or AWS Step Functions.

• Implemented data validation and quality checks to ensure accuracy, completeness, and reliability of datasets.

• Optimized SQL queries and Spark jobs for performance, reducing runtime and infrastructure costs.

• Assisted in migrating legacy systems and on-premise data warehouses to cloud-based platforms.

• Built reusable Python, SQL, and Shell scripts for ETL, data transformation, and automation.

• Collaborated with cross-functional teams including data scientists, analysts, and business stakeholders to deliver actionable insights.

• Monitored and troubleshooted pipelines using CloudWatch, Datadog, or custom logging frameworks for operational excellence.

• Ensured data governance, access controls, and security compliance for sensitive and regulated data.

• Supported CI/CD pipelines for ETL and cloud infrastructure deployments using Jenkins, Git, and Terraform.

• Developed dashboards and visualizations in Tableau, Power BI, or AWS QuickSight to drive business decisions.

• Applied performance tuning and partitioning strategies to optimize data storage and retrieval.

• Participated in requirements gathering, design discussions, and sprint planning in Agile/Scrum environments.

• Researched and applied emerging big data and cloud technologies to enhance platform scalability and efficiency. EDUCATION:

Master’s in computer science from Illinois Institute of Technology, Chicago, IL.

Contact this candidate