Deekshita Srihari G
Data Engineer
Irving, Texas **************@*****.*** 469-***-****
PROFESSIONAL SUMMARY
• Data Engineer with 5+ years of experience building and optimizing large-scale data pipelines on AWS using Spark, Glue, and EMR Serverless for both batch and streaming workloads.
• Hands-on with Change Data Capture frameworks such as Debezium for real-time ingestion from relational databases into S3-based data lakes, ensuring reliability and low latency.
• Skilled in writing PySpark and SparkSQL transformations for cleansing, deduplication, and schema enforcement, enabling accurate and analytics-ready data.
• Experienced in orchestrating ETL pipelines through Apache Airflow and AWS Step Functions with robust error handling, monitoring, and dependency management.
• Proficient in optimizing Spark performance through partitioning, caching, shuffle tuning, and cluster-level configuration for cost-efficient processing.
• Strong programming experience in Java and Python, integrating APIs and implementing Lambda-based micro-ETL components for lightweight automation.
• Adept at leveraging Glue Data Catalog, Redshift Spectrum, and S3 best practices for data organization, lineage, and governance.
• Recognized for improving data freshness and reliability by building scalable, reusable CDC pipelines that bridge transactional systems with analytics platforms.
EXPERIENCE
Fidelity Investments Fort Worth, TX
Cloud Data Engineer Nov 2024 – Present
• Implemented end-to-end CDC pipelines using Debezium connectors for MySQL and PostgreSQL, continuously capturing insert, update, and delete events into S3 landing zones.
• Built PySpark ETL jobs on EMR to process CDC streams into partitioned Delta datasets, applying schema evolution, deduplication, and late-arrival handling.
• Developed Spark Streaming applications using Structured Streaming for near real-time enrichment of CDC data and delivery to curated S3 buckets.
• Designed Airflow DAGs and AWS Step Functions for orchestrating CDC hydration, validation, and transformation jobs with detailed failure recovery and logging.
• Tuned Spark and Glue ETL performance through optimized shuffle partitions, dynamic allocation, and broadcast joins to reduce compute cost by 30%.
• Created Python Lambda functions to trigger CDC table sync runs, automate metadata registration in Glue Catalog, and initiate downstream ETL batches.
• Implemented S3 versioning, lifecycle rules, and prefix-based partition strategy to support efficient CRUD operations for hydrated data layers.
• Collaborated with data analytics teams to publish CDC processed data to Redshift Spectrum, enabling low-latency ad- hoc querying.
• Added AWS Deequ validation framework to run automated data quality checks on refreshed CDC datasets. Amerant Bank Houston, TX
Data Engineer Intern May 2024 – Oct 2024
• Assisted in developing ETL pipelines with Glue and PySpark to load transactional data into Redshift, improving data availability for credit risk analysis.
• Built Python scripts for data validation and cleansing, reducing ingestion errors by 25% and strengthening my programming skills for data quality improvements.
• Tuned Redshift queries using sort and distribution keys, which reduced execution times by 35% and provided analysts with faster access to financial reports.
• Developed Airflow DAGs for automated job scheduling, gaining practical experience with orchestration and improving consistency of pipeline runs.
• Assisted in building PySpark ETL jobs to transform CDC raw data into analytics-ready datasets with deduplication and timestamp alignment.
• Learned S3 best practices including object partitioning by event date and table name to reduce query scan costs.
• Documented metadata and data lineage using Glue Data Catalog, improving governance practices and helping me learn metadata-driven approaches to ETL.
• Participated in code reviews with senior engineers, enhancing my knowledge of collaborative development and version control best practices.
Accenture Mumbai, India
Senior Data Engineer Feb 2022 – Jul 2023
• Led development of large-scale ETL pipelines using AWS Glue and PySpark, enabling ingestion from diverse systems into Redshift, which became the enterprise data warehouse for analytics teams.
• Integrated Glue Data Catalog with Redshift Spectrum for schema discovery and federated query access to CDC data.
• Built CDC frameworks for financial data sources using Debezium and Kafka Connect, streaming incremental changes into S3 for lake hydration.
• Migrated on-prem Hadoop Hive tables into AWS S3 and Redshift, which reduced infrastructure costs by 25% and gave me practical migration experience.
• Integrated Kinesis-based event streaming pipelines to process near real-time data, reducing latency for operational dashboards and enabling timely business insights.
• Improved pipeline observability by implementing Airflow DAGs with retry logic, alerting, and lineage tracking, which improved stability and reduced incident response time.
• Partnered with DevOps team to deploy CDC and Spark jobs through Jenkins and CodePipeline, enabling continuous integration and safe rollbacks.
• Optimized EMR clusters by profiling SparkSQL jobs and tuning I/O to achieve 40% faster execution on large partitions.
HCLTech Chennai, India
Data Engineer Jul 2021 – Jan 2022
• Implemented HIPAA-compliant ingestion pipelines using Dataflow to process patient claim data, ensuring data security and compliance with regulatory standards.
• Developed FHIR JSON ingestion frameworks into BigQuery, enabling standardized healthcare data structures and interoperability across reporting applications.
• Built PySpark transformations in Dataproc to normalize claim records, reducing processing time by 30% and improving my skills in Spark optimization.
• Designed automated data validation checks in Composer pipelines, which reduced downstream data errors by 20% and improved reliability of healthcare analytics.
• Created curated datasets in BigQuery for healthcare analysts, reducing dependency on raw transactional tables and accelerating reporting turnaround.
• Partnered with compliance teams to review access control and encryption in pipelines, achieving HIPAA compliance and deepening my knowledge of healthcare data governance.
• Integrated Stackdriver monitoring with Composer workflows, providing proactive failure alerts and reducing downtime for reporting systems.
• Documented pipeline workflows, schema mappings, and data quality rules, which simplified project onboarding and ensured knowledge transfer within the team.
Hero MotoCorp Mumbai, India
Associate Data Engineer May 2019 – Jun 2021
• Assisted in migrating data from MySQL into BigQuery using Dataflow pipelines, gaining early experience with cloud- native data warehousing and distributed data processing.
• Wrote SQL scripts for business reporting in BigQuery, learning advanced query techniques such as window functions, which improved analytics for internal operations teams.
• Developed basic Spark ETL framework on Dataproc to merge new records with existing parquet datasets, enhancing data lake freshness.
• Learned data modeling basics by helping design star schemas in BigQuery datasets, which improved query performance and gave me hands-on knowledge of schema design.
• Designed small CDC-style incremental load scripts to sync operational MySQL data to GCS, reducing load windows and avoiding redundant processing.
• Created basic Composer DAGs to schedule recurring ingestion jobs, which taught me orchestration fundamentals and improved consistency of pipeline runs.
• Automated loading of CSV and JSON files into GCS and BigQuery, reducing manual workload by 20% and improving my understanding of cloud-based ingestion workflows.
• Assisted senior engineers in experimenting with Debezium change streams to understand offset management and schema evolution.
SKILLS
• Cloud Platforms: AWS (S3, Redshift, Glue, EMR, Lambda, Kinesis, Athena), GCP (BigQuery, Dataflow, Composer, Dataproc)
• Big Data & Processing: Apache Spark, PySpark, Delta Lake, Hadoop, Hive
• Orchestration & ETL: Airflow, DBT, Informatica, SSIS, Step Functions
• Databases: SQL (advanced), PostgreSQL, MySQL, MS SQL Server, NoSQL basics
• Programming & APIs: Python, SQL scripting, REST API integration
• Streaming: Kafka, AWS Kinesis
• Data Modeling & Governance: Star schema, Snowflake schema, schema evolution, data quality, lineage
• BI & Visualization: Power BI, Tableau
• DevOps & Infra: Terraform, Git, CI/CD (Jenkins, GitHub Actions, CodePipeline), Docker CERTIFICATION
• AWS Certified Cloud Practitioner
• AWS Certified Data Engineer – Associate
EDUCATION
University of Central Missouri Warrensburg, MO
Master of Science in Computer Science