Senior Data Engineer - ETL, PySpark, AWS/GCP

Location:

Gainesville, TX

Posted:

May 04, 2026

Contact this candidate

Resume:

SOWJANYA GUNAGANTI

Dallas, TX, USA +1-660-***-**** *********@*****.***

PROFESSIONAL SUMMARY

Senior Data Engineer with 5+ years of experience delivering enterprise-scale data and content migrations using Python and AWS Glue. Skilled in building scalable PySpark pipelines on AWS EMR, designing multi-zone data lake architectures, and automating workflows with Airflow and Step Functions. Proven ability to optimize large-scale ETL processes and support analytics through governed data platforms.

PROFESSIONAL EXPERIENCE

Toyota Motors North America GCP Data Engineer Jul 2024 - Present

• Architected enterprise-scale data platforms across GCP supporting analytics, forecasting, and ML workloads.

• Designed multi-zone lakehouse structures (Raw, Curated, Semantic) enabling governed enterprise data modeling.

• Built scalable PySpark pipelines on DataProc processing structured and semi-structured datasets.

• Developed metadata-driven Airflow/Composer DAGs supporting reusable operators and lineage propagation.

• Implemented validation frameworks using BigQuery scripting, Python rule engines, and serverless Cloud Functions.

• Optimized SQL models using clustering, partition pruning, and materialized aggregates.

• Integrated curated semantic layers with Power BI and Looker, enabling enterprise-wide KPI reporting.

• Delivered S2T mappings, profiling documentation, and semantic dictionaries for governance.

• Built event-driven ingestion using Pub/Sub and S3 notifications improving data freshness SLAs.

• Collaborated with business and ML teams ensuring datasets aligned with feature engineering needs. Elevance Health AWS Data Engineer Jan 2023 - Jun 2024

• Developed large-scale Python-based PySpark ETL pipelines on AWS Glue and EMR processing 1–3 TB/day with optimized partition pruning, pushdown predicates, and broadcast joins improving runtime significantly.

• Designed multi-zone S3 data lake (Raw Standardized Curated) using Parquet/ORC, schema evolution handling, and metadata tagging for governed analytics and facilitating content migration workflows.

• Implemented incremental ingestion using Glue job bookmarks, CDC logic, watermarks, and DynamoDB state tables eliminating full reload requirements.

• Built real-time streaming pipelines using Kinesis Data Streams, Kinesis Firehose, and Lambda, achieving low-latency ingestion for operational datasets.

• Automated complex ETL workflows orchestrating spark jobs using Step Functions and Lambda implementing retries, DLQs, conditional branching, and fully parallelized execution.

• Optimized Athena and Redshift external schemas through Parquet conversion, compression, projection, and compaction reducing query cost.

• Engineered AWS Deequ and custom PySpark data quality rule engines validating schema drift, primary key integrity, null violations, and distribution anomalies.

• Developed advanced observability with CloudWatch dashboards, Glue job profiling, EMR logs, metrics-based alarms, and SNS notifications for SLA enforcement.

• Built CI/CD pipelines using CodePipeline, CodeBuild, and Terraform automating Glue job packaging, IAM provisioning, infra rollouts, and versioning.

WalkingTree Technologies Big Data Developer Jan 2020 - Aug 2021

• Developed Spark-Scala, PySpark, and Hive ETL pipelines for large-scale batch analytics.

• Automated ingestion from Teradata, Oracle, and Snowflake using Sqoop, Kafka, and shell workflows.

• Designed optimized Hive schemas using partitioning, bucketing, and ORC/Parquet formats.

• Built Spark enrichment modules applying rule-driven transformations and quality enforcement.

• Developed reusable ingestion templates accelerating onboarding of new datasets.

• Integrated warehouses with Hadoop ecosystems enabling unified transformation layers.

• Delivered pre-aggregated BI extracts for Tableau and Power BI reporting.

• Improved Spark performance through memory tuning, shuffle optimization, and spill reduction.

• Supported cluster operations including YARN tuning and Hive metadata maintenance.

• Prepared engineering documentation including lineage diagrams and design specifications. EDUCATION

Northwest Missouri State University Master's, Applied Computer Science USA 2022 Jawaharlal Nehru Technological University Bachelor's, Computer Science India 2019 TECHNICAL SUMMARY

• Languages: Python, SQL, PySpark, Shell, Scripting

• Cloud Platforms: GCP (BigQuery, DataProc, Composer), AWS (S3, Glue, Redshift, Lambda, EMR)

• Data Engineering: Data Lakes, ETL/ELT Pipelines, Batch & Streaming, Data Quality, Data Modeling, Governance, Content Migration

• Big Data: Hadoop, Hive, Spark, Kafka, Sqoop

• Orchestration: Airflow, Composer, Step Functions, ADF

• Databases: BigQuery, Snowflake, Redshift, SQL Server, Oracle, PostgreSQL

• Tools: Git, Terraform, Docker, Tableau, Power BI, Jira, Confluence CERTIFICATIONS

• Google Cloud Professional Data Engineer: 2024

• SQL for Data Engineers: Coursera

• AWS Certified Data Analytics – Specialty: 2023

• Apache Airflow Fundamentals: Udemy

ACHIEVEMENTS

• Delivered multiple cloud migration and automation solutions across GCP, AWS, and Hadoop ecosystems.

• Recognized for designing reusable ETL/ELT patterns reducing onboarding effort Significantly.

• Enabled enterprise analytics through governed semantic datasets and KPI layers.

Contact this candidate