NAGA MAHITHA BATCHU
Dallas, TX +1-913-***-**** ***************@*****.*** Linkedin
PROFESSIONAL SUMMARY
Data Engineer with 5+ years of experience designing and building scalable ETL pipelines and cloud-native data platforms across AWS and Azure using PySpark, SQL, Databricks, and Apache Spark. Strong expertise in batch and real-time data processing, data migration from legacy systems to modern lakehouse architectures, and developing high-performance SQL transformations for large-scale datasets. Proven ability to optimize distributed data workloads, improve query performance, and deliver reliable, analytics-ready datasets for finance and risk domains. Experienced in integrating multi-source data, implementing data quality frameworks, and supporting ML-driven analytics use cases. SKILLS
• Core Programming & Computation: Python, SQL, PySpark, Spark SQL
• Data Engineering Expertise: ETL/ELT Pipelines, Data Modeling (Star/Snowflake), CDC, SCD, Incremental Processing, Data Integration, Idempotent Pipelines, Data Quality Validation
• Cloud & Data Platforms: Azure (Data Factory ADF, ADLS Gen2, Azure Databricks, Synapse), AWS(S3, Glue, Lambda, EMR, Redshift, Athena, Step Functions, DMS), Snowflake, Delta Lake, Apache Spark, Databricks
• Orchestration & Pipelines: Apache Airflow, Azure Data Factory, dbt, REST API, Kafka
• DevOps & MLOps: Terraform, CI/CD, MLflow, Docker, GitHub Actions, Kubernetes, Model Serving
• Advanced Engineering: Qdrant (Vector DB), Neo4j (Graph DB), Elasticsearch CERTIFICATIONS
• Microsoft Azure Data Engineer Associate (DP-203) • Databricks Certified Data Engineer EXPERIENCE
University of Central Missouri Data Engineer Aug 2024 - Nov 2025
• ETL & Data Processing : Engineered scalable PySpark ETL pipelines using Spark SQL and optimized partitioning strategies to process and transform large datasets for analytics and ML feature engineering.
• Incremental Data Processing : Designed incremental ingestion workflows using CDC and efficient data update strategies to reduce redundant processing and improve pipeline performance.
• Data Modeling & SQL : Developed complex SQL transformations and implemented relational and graph-based data models using SQL and Neo4j to support advanced analytical queries.
• Data Quality & Validation : Implemented automated data validation checks, schema enforcement, and consistency rules to ensure high data reliability across distributed pipelines.
• ML Data Systems & Automation : Built automated data pipelines integrating embedding-based search with Qdrant and enabled reproducible workflows using DVC and GitHub Actions for reliable experimentation. Straviso Data Engineer Dec 2021 - Nov 2023
• ETL Pipeline Engineering & Data Ingestion : Built automated ingestion pipelines for 25+ TB-scale on-prem datasets into Amazon S3 using AWS Glue, Lambda, and Step Functions, enabling incremental querying through Athena and eliminating manual batch ingestion processes.
• Configuration-Driven Aggregation Framework : Designed a configuration-driven data aggregation framework where SQL transformations and dependencies were defined in JSON and executed through Step Functions + PySpark Glue + Lambda, enabling automated orchestration and reducing manual intervention in daily aggregation workflows.
• Data Security & Tokenization : Architected tokenization pipelines for 500M+ sensitive PII records using AWS Lambda and Athena, improving query performance while ensuring compliance with enterprise security and data governance standards.
• Database Migration : Migrated financial datasets from Oracle databases to Amazon S3 using AWS DMS, implementing schema validation, integrity checks, and automated workflows for reliable large-scale data transfer.
• SQL Optimization & Data Modeling : Optimized SQL transformations and implemented CDC/SCD data models in Snowflake and Redshift, improving query performance by 30–40%.
• Cross-Account Data Replication : Implemented cross-account AWS DMS configurations enabling secure multi-environment data migrations and improving data portability across cloud environments. Nextenture Data Engineer Sep 2019 - Sep 2021
• Legacy System Migration : Led migration of complex datasets from DB2 and mainframe systems to AWS RDS PostgreSQL, improving query performance by 30% and enabling modern analytics capabilities.
• Real-Time Data Ingestion : Engineered near real-time ingestion pipelines using Snowpipe and event-driven architectures to ensure high-frequency data availability for financial and operational reporting.
• ETL Pipeline Orchestration : Designed and orchestrated scalable ETL/ELT workflows using Apache Airflow to automate ingestion of multi-source enterprise data into AWS S3 and Redshift.
• Data Warehouse Optimization : Optimized Redshift performance using distribution and sort key strategies, significantly improving query efficiency for high-traffic dashboards.
• Performance & Data Quality : Tuned distributed Spark transformations to reduce batch processing time by 35% and implemented Great Expectations for automated data validation and schema enforcement.
• Implemented automated data validation using Great Expectations, enforcing strict schema integrity and establishing a trusted data framework for downstream analytics.
EDUCATION
University of Central Missouri Master's, Data Science & AI Lee’s Summit, MO 2024 - 2025