Data Engineer Big

Location:

California

Salary:

97000

Posted:

October 16, 2025

Contact this candidate

Resume:

Shiva Reddy Anugu

Mobile: +1-346-***-**** Email: *****************@*****.*** Address: USA (Willing to relocate)

Professional Summary

●CDC and Streaming Expertise: Expert in setting up and operating Change Data Capture (CDC) for multiple database types (e.g., MySQL, Postgres) to efficiently hydrate the enterprise data lake in AWS S3.

●Apache Spark & ETL Transformations: 4+ years of hands-on experience designing and implementing complex ETL transformations using Apache Spark (Data Frames, Spark SQL) for both high-velocity streaming and batch processing of data.

●AWS Cloud Infrastructure: Extensive AWS skillset including managing S3 (and advanced S3 operations/CRUD), EMR & EMR Serverless, Glue Data Catalog, and coordinating work across services.

●Orchestration & Automation: Proficient in designing and deploying automated data workflows using MWAA (Managed Workflows Apache Airflow) for complex scheduling and event-driven architectures utilizing AWS Step Functions and Lambdas (Python).

●Programming Depth (Java & Python): Strong full-stack capability with Java (Mid to Senior level) for core application logic and robust Python (pyspark) for big data processing and automation scripts.

●Big Data Optimization: Applied advanced Big Data concepts and performance tuning techniques across EMR/Spark clusters and data formats (e.g., Parquet, Apache Hudi) to ensure queryable data for analytics, reducing compute costs by 15%.

Work Experience

Senior AWS Data Engineer Molina Healthcare – Long Beach, CA May 2024 – Present

●Engineered and deployed the core Change Data Capture (CDC) pipeline from multi-source transactional systems to the AWS S3 data lake, reducing data latency from hours to minutes.

●Developed high-efficiency Apache Spark Data Frames and ETL jobs utilizing pyspark on EMR Serverless for the transformation of raw CDC data into usable analytical gold layers.

●Designed and automated complex orchestration workflows using MWAA (Managed Workflows Apache Airflow), incorporating robust dependency management and alerting for all streaming and batch processing.

●Managed the security and lifecycle of the data lake on AWS S3, enforcing secure S3 operations (CRUD) and implementing encryption and access controls via IAM policies.

●Integrated optional data quality checks using an AWS Deequ framework within the Spark transformations to enforce schema and data integrity before downstream consumption.

●Utilized Python Lambdas and AWS Step Functions to implement event-driven data ingestion patterns and coordinate complex, multi-step AWS Batch processing jobs.

Big Data Engineer Infosys – Hyderabad, India February 2022 – August 2023

●Built and maintained large-scale batch processing pipelines on AWS EMR clusters, processing and transforming approximately 10TB of structured and unstructured data monthly.

●Developed scalable application logic in Java (Mid-level) for custom data parsing modules and integrated these within the Apache Spark ecosystem.

●Created and optimized Apache Spark Data Frames for complex ETL transformations, ensuring that raw data was orchestrated and transformed into consumable layers for analytics.

●Registered and maintained schema for all data assets in the Glue Data Catalog, significantly improving metadata management and governance.

●Applied performance tuning strategies (partitioning, caching, cluster sizing) to Big Data concepts within Spark, reducing average job runtimes by 20%.

●Contributed to an internal proof-of-concept utilizing Apache Hudi for efficient upsert capabilities within the S3 data lake.

Data Engineer Accenture – Hyderabad, India May 2020 – December 2021

●Gained foundational knowledge of Change Data Capture (CDC) concepts and assisted in setting up initial replication processes for relational databases.

●Developed initial ETL jobs and scheduling using base Apache Airflow and Python to manage daily batch workflows.

●Wrote core utility functions in Java to support data validation and preprocessing steps before loading data into the warehouse.

●Assisted with the management and configuration of S3 operations and bucket policies, focusing on data security and access control.

●Utilized Apache Spark Data Frames (via PySpark) for basic data manipulation tasks, establishing proficiency in distributed computing paradigms.

●Contributed to the team's understanding of Big Data concepts related to fault tolerance and distributed file systems.

Professional Skills

●Programming Languages: Java, Python (pyspark), Scala

●Big Data & Processing: Apache Spark (Data Frames, Spark SQL, Spark Streaming), ETL transformations, streaming and batch processing, Big Data concepts, performance tuning, Change Data Capture (CDC), Apache Hudi

●Tools & Platforms: AWS Skillset (S3, EMR, EMR Serverless, Glue Data Catalog, Step Functions, MWAA, Lambdas (Python), AWS Batch), Apache Airflow, AWS Deequ

●Others: S3 operations, Data Lake, ETL Jobs, Data Orchestration

Education

Master of Science in Information Technology Elmhurst University, Chicago, Illinois, USA

Certification

●Microsoft Certified: Azure Data Engineer Associate

●Databricks Certified: Data Engineer Associate

●AWS Certified: Data Analytics – Specialty

●Google Cloud: Professional Data Engineer

Contact this candidate