Post Job Free
Sign in

AWS Data Engineer - ELT, PySpark, Airflow

Location:
Denton, TX
Posted:
June 18, 2026

Contact this candidate

Resume:

Pankaj Varma Dendukuri

707-***-**** ***************@*****.*** LinkedIn

Professional Experience

Comerica Bank June 2024 – Present

Data Engineer Dallas, TX

Migrated 20+ on-prem banking pipelines into a scalable ELT layer on AWS, processing 15TB+ of financial data monthly and eliminating ingestion delays caused by manual batch jobs.

Tuned PySpark transformation jobs on large-scale transaction datasets using dynamic partition pruning and broadcast joins, cutting pipeline runtime by 52% on Amazon EMR.

Managed 25+ dependency-driven workflows in Apache Airflow (MWAA) with retry logic, SLA tracking, and failure alerting, maintaining pipeline observability and 99.8% on-time completion for loan and risk reporting.

Built and maintained feature pipelines for credit risk models using Python and SageMaker, enabling weekly model retraining and establishing reproducible MLOps workflows for consistent feature delivery.

Designed Azure Data Factory pipelines to load compliance datasets into a governed schema, enabling SOX audit workflows and cross-cloud exchange between Azure ingestion and AWS analytics layers.

Developed Java-based Kafka consumer services to stream real-time transaction events into S3, reducing data availability lag from 6+ hours to under 20 minutes for downstream EMR pipelines. Zelis May 2023 – Aug 2023

Data Engineer Intern Plano, TX

Developed AWS Glue ETL jobs to process 2M+ healthcare claims weekly into Redshift, improving load performance by 38% through job tuning and parallel execution.

Prepared 12+ months of claims data in Python for fraud detection models, handling deduplication, null treatment, and schema normalization for consistent training datasets.

Built dbt models to implement dimensional data modeling for healthcare claims, adding schema tests for duplicates and null values before loading into Snowflake reporting tables. Providence Health Aug 2021 – July 2022

Data Engineer Hyderabad, India

Built Azure Data Factory pipelines to move EHR data from 4+ on-prem clinical systems into Azure Data Lake Storage Gen2, standardizing source formats for downstream transformation.

Developed PySpark transformation jobs in Azure Databricks using Delta Lake within a medallion pipeline to clean, deduplicate, and normalize patient records across source systems, reducing downstream pipeline errors by 35%.

Enforced HIPAA compliance through column-level SQL validation on PII fields, ensuring sensitive patient data met handling standards before reaching downstream analytical systems.

Automated daily ingestion of third-party claims from S3 using Python workflows, reducing reconciliation time by 40%.

Optimized complex analytical SQL involving window functions and multi-stage joins in Spark, reducing patient utilization reporting query latency by 30% for executive Power BI dashboards. Academic Project

AI Data Pipeline for Real-Time Anomaly Detection

Designed a streaming anomaly detection pipeline using Kafka, Spark Structured Streaming, and Python, processing

~8K events/sec with consistent low-latency handling of financial transaction data.

Built feature extraction and model inference pipelines using MLflow, enabling batch and streaming scoring workflows and improving anomaly detection precision by 28% on validation datasets. Technical Skills

Programming: Python (PySpark, Pandas), Java, SQL, Scala Data Processing: Apache Spark, Spark SQL, Kafka, Spark Structured Streaming, Delta Lake Cloud Platforms: AWS (S3, Glue, EMR, Redshift, Lambda), Azure (Data Factory, ADLS Gen2, Databricks) Pipelines & Orchestration: Apache Airflow, dbt (Data Build Tool), ETL/ELT Databases & Warehousing: Amazon Redshift, PostgreSQL, Azure SQL, Snowflake AI/ML: Feature Engineering, Model Training & Inference Pipelines, MLflow, MLOps, Amazon SageMaker DevOps: Git, Docker, CI/CD (Jenkins/GitHub Actions) Visualization & Monitoring: Power BI, Tableau

Education

University of North Texas Aug 2022- May 2024

Master of Science in Information Science Denton, TX



Contact this candidate