Saipradeep Bomma
Dallas, Texas 940-***-**** *****************@*****.*** LinkedIn
PROFESSIONAL SUMMARY
• Senior Data Engineer with 5+ years of experience designing CDC-based pipelines and ETL workflows.
• Expert in Java, Python (PySpark), and Apache Spark for batch and streaming transformations.
• Hands-on with Debezium and CDC frameworks for hydrating data lakes from diverse databases.
• Proficient in Airflow, EMR, Glue Data Catalog, and AWS Step Functions for orchestration.
• Strong background in Spark SQL, DataFrames, and performance tuning for big data workloads.
• Focused on delivering queryable, analytics-ready datasets from raw CDC streams. TECHNICAL SKILLS
• Programming & Processing: Java, Python, PySpark, Scala (plus), Spark DataFrames, Spark SQL, Spark Streaming
• CDC & ETL: Debezium, Kafka Connect, AWS DMS, Apache Hudi (plus), Apache Griffin (plus)
• Orchestration & Workflow: Apache Airflow, AWS MWAA, AWS Step Functions, AWS Batch
• Cloud & AWS: S3 (CRUD operations), EMR, EMR Serverless, Glue Data Catalog, Lambda (Python)
• Big Data & Streaming: Spark Streaming, Kafka, Hudi (plus), Deequ (plus)
• Data Quality & Governance: Schema validation, monitoring, automated testing, lineage tracking WORK EXPERIENCE
FedEx Office Jul 2024 - Present
Azure Data Engineer Plano, Texas, USA
FedEx Office, a subsidiary of FedEx Corporation, is a dynamic and technology-driven leader in logistics
& business services. I am designing, building, and maintaining scalable data pipelines to ingest, transform & store large volumes of data from various sources such as transactional systems, third-party APIs & external data feeds.
• Developed CDC ingestion pipelines using Debezium and Kafka to hydrate data lakes, capturing changes from multiple relational databases into Delta Lake and Synapse for analytics.
• Optimized PySpark and Spark SQL jobs for batch and streaming transformations, reducing runtimes from 120 minutes to under 75 while ensuring reliability at enterprise scale.
• Implemented Airflow DAGs with CI/CD validation to orchestrate raw CDC data flows, automating inges- tion and transformation into curated, queryable analytics datasets.
• Built Delta Lake architectures with ACID compliance and schema evolution, supporting data warehouse modeling and near real-time analytics for enterprise reporting.
• Configured AWS Glue Data Catalog with EMR pipelines, enabling discoverability and governance of CDC-ingested data across S3 buckets.
• Collaborated with cross-functional analytics teams to standardize data contracts, improving consistency and usability of datasets across BI and ML workflows. Environment: Debezium, Kafka, Spark (PySpark/SQL), Delta Lake, Airflow, AWS S3, EMR, Glue, Step Functions
Texas Health Resources Sep 2023 - Jun 2024
AWS Data Engineer Arlington, Texas, USA
Texas Health Resources (THR) is a nonprofit health care organization it provides healthcare services. I collected and integrated the data from various sources such as electronic health records (EHR), financial systems, and operational databases. Developed and maintained the ETL (Extract, Transform, Load) processes to ensure data is accurate, consistent, and reliable.
• Implemented CDC-based ingestion of EHR and financial systems into AWS S3 using Debezium, enabling near real-time data hydration for healthcare analytics and reporting.
• Developed PySpark ETL pipelines in EMR and Glue to transform CDC data into governed Snowflake and Redshift marts, ensuring compliance with HIPAA data standards.
• Automated Airflow DAGs on MWAA for batch and streaming workflows, reducing manual interventions from 40 to 15 monthly across production environments.
• Built Lambdas in Python to handle schema changes and CDC data reconciliation, improving resilience of ingestion pipelines across critical healthcare datasets.
• Delivered AWS Step Functions workflows for orchestrating multi-stage ETL pipelines, reducing opera- tional complexity and improving auditability of healthcare data pipelines.
• Partnered with BI teams to deliver curated marts and dashboards, accelerating time-to-insight for opera- tional and compliance reporting.
Environment: Debezium, Kafka, PySpark, AWS EMR, Glue, S3, Redshift, Lambda, Step Functions, MWAA, Airflow
Spearhead Insurance Broking Pvt Ltd Jan 2022 - Jul 2023 GCP Data Engineer Hyderabad, India
Spearhead Insurance Broking Pvt Ltd is a private company in the insurance sector. Built ETL pipelines using Google Cloud Dataflow/Apache Beam and implemented security best practices with IAM, encryption, and monitoring tools.
• Built streaming ingestion pipelines from claims and policy systems using Kafka + Debezium, hydrating BigQuery data marts for actuarial modeling and fraud detection analytics.
• Optimized Dataproc Spark jobs with partitioning, caching, and shuffle tuning, reducing runtimes from 8 hours to 5 for large-scale actuarial reporting workloads.
• Created Airflow DAGs in Cloud Composer for orchestrating CDC data pipelines, improving reliability and reducing failed workflows from 20 monthly to fewer than 10.
• Designed BigQuery marts with dbt and role-based access controls, expanding governed access from 4 user groups to 10 while ensuring regulatory compliance.
• Enhanced pipeline monitoring with Stackdriver and Datadog, reducing incident response times from 60 minutes to under 40 across production-critical CDC pipelines.
• Collaborated with business analysts to deliver curated, analytics-ready datasets from raw CDC sources, enabling faster insights and more consistent data definitions. Environment: Debezium, Kafka, Dataflow, Dataproc, PySpark, Airflow, BigQuery, dbt, Cloud Composer, Datadog
Electronics Mart India Limited Jun 2020 - Dec 2021 Data Engineer Hyderabad, India
EMIL is a leading Indian retail company for electronics and consumer durables. Leveraged big data tools like Hadoop, Spark, Kafka, and Flink for large-scale and real-time data processing. Aug 2023 - May 2025
• Engineered Kafka + Spark Streaming CDC pipelines for real-time retail transactions, hydrating Snow- flake marts and enabling low-latency dashboards for demand forecasting.
• Migrated data from on-prem MySQL to S3 + Snowflake using CDC connectors, reducing reporting query times from 25 seconds to under 12 across BI reporting workloads.
• Built Databricks PySpark pipelines for batch and streaming ETL, transforming raw CDC ingestion into curated marts for sales and inventory analytics.
• Optimized Airflow DAG orchestration by integrating schema validation, reducing manual interventions from 50 monthly to fewer than 20.
• Implemented S3 lifecycle policies and Glue Data Catalogs for CDC-ingested data, improving governance and cost optimization in analytics environments.
• Delivered CDC-ready dashboards in Power BI, training stakeholders on leveraging curated marts for self- service reporting and decision-making.
Environment: Kafka, Debezium, Spark (PySpark), Databricks, Airflow, AWS S3, Snowflake, Glue ACHIEVEMENTS
• Awarded recognition at FedEx Office for designing CI/CD-enabled pipelines that reduced data failures from 100 incidents per month to under 70.
• Made cost savings at Texas Health Resources by optimizing AWS Lambda and EMR workflows, lower- ing monthly cloud spend from $50K to $30K.
• Implemented governed Snowflake and BigQuery architectures at Spearhead Insurance and EMIL, ex- panding secure data access from 5 teams to 12 enterprise units. EDUCATION DETAILS
University of North Texas, Texas, USA
Masters, Data Science GPA: 4/4
CERTIFICATIONS
• AWS certified Data Engineer Associate
• Microsoft certified Azure Data Engineer Associate
• Google Cloud Associate Cloud Engineer