Shiva Reddy Anugu
Mobile: +1-346-***-**** Email: *****************@*****.*** Address: USA (Willing to relocate)
Professional Summary
●CDC and Streaming Expertise: Expert in setting up and operating Change Data Capture (CDC) for multiple database types (e.g., MySQL, Postgres) to efficiently hydrate the enterprise data lake in AWS S3.
●Apache Spark & ETL Transformations: 4+ years of hands-on experience designing and implementing complex ETL transformations using Apache Spark (Data Frames, Spark SQL) for both high-velocity streaming and batch processing of data.
●AWS Cloud Infrastructure: Extensive AWS skillset including managing S3 (and advanced S3 operations/CRUD), EMR & EMR Serverless, Glue Data Catalog, and coordinating work across services.
●Orchestration & Automation: Proficient in designing and deploying automated data workflows using MWAA (Managed Workflows Apache Airflow) for complex scheduling and event-driven architectures utilizing AWS Step Functions and Lambdas (Python).
●Programming Depth (Java & Python): Strong full-stack capability with Java (Mid to Senior level) for core application logic and robust Python (pyspark) for big data processing and automation scripts.
●Big Data Optimization: Applied advanced Big Data concepts and performance tuning techniques across EMR/Spark clusters and data formats (e.g., Parquet, Apache Hudi) to ensure queryable data for analytics, reducing compute costs by 15%.
Work Experience
Senior AWS Data Engineer Molina Healthcare – Long Beach, CA May 2024 – Present
●Engineered and deployed the core Change Data Capture (CDC) pipeline from multi-source transactional systems to the AWS S3 data lake, reducing data latency from hours to minutes.
●Developed high-efficiency Apache Spark Data Frames and ETL jobs utilizing pyspark on EMR Serverless for the transformation of raw CDC data into usable analytical gold layers.
●Designed and automated complex orchestration workflows using MWAA (Managed Workflows Apache Airflow), incorporating robust dependency management and alerting for all streaming and batch processing.
●Managed the security and lifecycle of the data lake on AWS S3, enforcing secure S3 operations (CRUD) and implementing encryption and access controls via IAM policies.
●Integrated optional data quality checks using an AWS Deequ framework within the Spark transformations to enforce schema and data integrity before downstream consumption.
●Utilized Python Lambdas and AWS Step Functions to implement event-driven data ingestion patterns and coordinate complex, multi-step AWS Batch processing jobs.
Big Data Engineer Infosys – Hyderabad, India February 2022 – August 2023
●Built and maintained large-scale batch processing pipelines on AWS EMR clusters, processing and transforming approximately 10TB of structured and unstructured data monthly.
●Developed scalable application logic in Java (Mid-level) for custom data parsing modules and integrated these within the Apache Spark ecosystem.
●Created and optimized Apache Spark Data Frames for complex ETL transformations, ensuring that raw data was orchestrated and transformed into consumable layers for analytics.
●Registered and maintained schema for all data assets in the Glue Data Catalog, significantly improving metadata management and governance.
●Applied performance tuning strategies (partitioning, caching, cluster sizing) to Big Data concepts within Spark, reducing average job runtimes by 20%.
●Contributed to an internal proof-of-concept utilizing Apache Hudi for efficient upsert capabilities within the S3 data lake.
Data Engineer Accenture – Hyderabad, India May 2020 – December 2021
●Gained foundational knowledge of Change Data Capture (CDC) concepts and assisted in setting up initial replication processes for relational databases.
●Developed initial ETL jobs and scheduling using base Apache Airflow and Python to manage daily batch workflows.
●Wrote core utility functions in Java to support data validation and preprocessing steps before loading data into the warehouse.
●Assisted with the management and configuration of S3 operations and bucket policies, focusing on data security and access control.
●Utilized Apache Spark Data Frames (via PySpark) for basic data manipulation tasks, establishing proficiency in distributed computing paradigms.
●Contributed to the team's understanding of Big Data concepts related to fault tolerance and distributed file systems.
Professional Skills
●Programming Languages: Java, Python (pyspark), Scala
●Big Data & Processing: Apache Spark (Data Frames, Spark SQL, Spark Streaming), ETL transformations, streaming and batch processing, Big Data concepts, performance tuning, Change Data Capture (CDC), Apache Hudi
●Tools & Platforms: AWS Skillset (S3, EMR, EMR Serverless, Glue Data Catalog, Step Functions, MWAA, Lambdas (Python), AWS Batch), Apache Airflow, AWS Deequ
●Others: S3 operations, Data Lake, ETL Jobs, Data Orchestration
Education
Master of Science in Information Technology Elmhurst University, Chicago, Illinois, USA
Certification
●Microsoft Certified: Azure Data Engineer Associate
●Databricks Certified: Data Engineer Associate
●AWS Certified: Data Analytics – Specialty
●Google Cloud: Professional Data Engineer