Sai Bodlapati
+1-557-***-**** **************@*****.*** Open to Relocation
PROFESSIONAL SUMMARY
Senior Data Engineer with 4+ years of experience designing and building scalable batch and real-time data platforms in financial services and healthcare domains. Specialized in architecting distributed ETL/ELT pipelines using Apache Spark, Kafka, and Databricks to process millions of records daily. Experienced in building Lakehouse architecture using Delta Lake on AWS and Azure, enabling reliable analytics, reporting, and machine learning workloads. Proven track record of optimizing Spark pipelines by 30%, implementing data reliability frameworks, and designing secure, high-performance cloud-native data platforms. Strong expertise in data modeling, streaming architecture, infrastructure automation, and data governance. TECHNICAL SKILLS
Programming Languages : Python, Java, SQL, Spark SQL
Data Engineering : ETL / ELT Pipelines, Data Ingestion, Data Transformation, Data Validation
Data Warehousing & Analytics : Data Warehousing, Star & Snowflake Schema, Fact & Dimension Tables
Data Orchestration & Workflow : Workflow Orchestration, Pipeline Scheduling, Dependency Management
Data Quality & Observability : Data Quality Checks, Data Validation Frameworks, Data Observability, SLA
Big Data & Analytics Platforms : Apache Spark, Databricks (Notebooks, Jobs, Delta Lake)
Streaming & Messaging : Apache Kafka, Spark Structured Streaming
Databases & Storage : PostgreSQL, MongoDB, Redis, AWS RDS, Amazon S3, Delta Lake
Cloud Platforms : AWS (S3, EC2, RDS, IAM, CloudWatch, EKS), Databricks on Azure, GCP
Data Processing : Batch Processing, Real-Time Streaming, Incremental Loads
DevOps & Automation : Docker, Kubernetes (EKS), Jenkins, Terraform, AWS CodePipeline
Monitoring & Reliability : CloudWatch, Prometheus, Grafana, ELK Stack
Security & Governance : IAM, OAuth 2.0, JWT, AES-256 Encryption, Data Access Controls
Tools & Methodologies : Git, JIRA, Agile/Scrum
PROFESSIONAL EXPERIENCE
Fidelity Apr 2024 – Present
Sr Data Engineer Dallas
Architect and maintain scalable batch and real-time data pipelines on Databricks and AWS to process large-scale financial transactions and event data.
Designed and implemented distributed ETL and streaming pipelines using Apache Spark and Kafka, processing millions of financial transaction records daily and enabling real-time analytics and reporting.
Built end-to-end data pipelines to ingest data from Kafka streams, REST APIs, and relational databases, transforming and storing curated datasets in Delta Lake on AWS S3.
Architected Lakehouse data architecture using Delta Lake, implementing bronze, silver, and gold layers to improve data quality, traceability, and downstream analytics reliability.
Developed high-performance Spark and PySpark transformation pipelines in Databricks, enabling analytics and machine learning teams to consume curated datasets efficiently.
Integrated Kafka with Spark Structured Streaming to enable real-time ingestion and processing of financial events for analytics and monitoring use cases.
Optimized Spark job performance using partitioning strategies, caching, and query optimization techniques, reducing pipeline runtime by 30%.
Designed and optimized PostgreSQL and AWS RDS schemas to support analytical workloads and improve query performance.
Implemented automated data pipeline scheduling and orchestration using Databricks Jobs and CI/CD pipelines, improving deployment efficiency and reliability.
Built monitoring and alerting systems using CloudWatch, Prometheus, and Grafana to proactively detect and resolve pipeline failures.
Containerized pipeline components using Docker and deployed services on Kubernetes (EKS) for scalability and high availability.
Automated infrastructure provisioning and deployment workflows using Terraform and AWS CodePipeline.
Implemented data validation and quality checks, including schema validation and integrity checks, reducing production data issues.
Applied IAM-based access controls and secure data access policies to ensure compliance with financial data security standards.
Collaborated with data analysts, data scientists, and business teams to design analytics-ready datasets and support reporting and decision-making.
Wipro Aug 2021 – Dec 2023
Data Engineer Hyderabad
Designed and implemented scalable batch and real-time data pipelines on Databricks and Azure to process large- scale healthcare and clinical datasets for analytics, reporting, and regulatory compliance.
Architected and built end-to-end ETL pipelines using Apache Spark and Databricks to ingest, transform, and process structured and semi-structured healthcare datasets from PostgreSQL, MongoDB, and external APIs into ADL.
Designed scalable Lakehouse architecture using Delta Lake, implementing Bronze (raw), Silver (cleaned), and Gold
(aggregated) data layers to ensure data quality, traceability, and optimized analytics performance.
Developed high-performance PySpark transformation pipelines to clean, normalize, enrich, and aggregate large healthcare datasets, enabling reliable reporting and operational analytics.
Built real-time streaming data pipelines using Apache Kafka and Spark Structured Streaming to ingest clinical event data and operational system logs into Delta Lake for near real-time analytics.
Implemented incremental and batch data ingestion strategies using watermarking and change tracking concepts to efficiently process new and updated healthcare records without reprocessing entire datasets.
Designed and optimized Delta Lake tables using partitioning strategies and efficient file formats (Parquet), improving query performance and reducing storage and compute costs.
Implemented data validation and quality frameworks using Spark, including schema validation, null checks, duplicate detection, and reconciliation logic to ensure data accuracy and integrity.
Automated data pipeline execution using Databricks Jobs and integrated CI/CD pipelines using Jenkins and Terraform, ensuring reliable deployment across development, QA, and production environments.
Containerized Spark jobs and supporting services using Docker and deployed workloads on Kubernetes clusters to enable scalable and fault-tolerant execution.
Implemented pipeline monitoring, logging, and alerting using Prometheus, Grafana, and ELK Stack to track pipeline health, detect failures, and ensure SLA compliance.
Optimized Spark job performance using partition tuning, broadcast joins, caching, and memory optimization techniques, improving pipeline efficiency by approximately 20%.
Designed relational data schemas and optimized queries in PostgreSQL to support efficient ingestion and downstream analytical workloads.
Implemented secure data access controls using role-based permissions and encryption to ensure compliance with healthcare data security and privacy requirements.
Developed reusable and parameterized Spark jobs to standardize data ingestion and transformation across multiple healthcare data sources, improving pipeline maintainability.
Performed root cause analysis on pipeline failures and implemented robust retry, checkpointing, and error-handling mechanisms to improve pipeline reliability.
Collaborated with data analysts, reporting teams, and business stakeholders to understand data requirements and deliver analytics-ready datasets for reporting dashboards and operational insights.
Supported regulatory reporting and audit requirements by enabling reliable historical data tracking using Delta Lake time travel and versioning features.
Worked closely with DevOps and cloud teams to standardize infrastructure provisioning and deployment workflows using Terraform and Kubernetes.
ACHIEVEMENTS
Key Contributor Award for Optimized Spark and Databricks ETL pipelines using partitioning, caching, and Delta Lake best practices, reducing end-to-end data processing time by 30% and improving analytics availability.
Outstanding Team Player Award for implementing data validation, monitoring, and alerting frameworks that reduced recurring data failures by 30% and significantly improved trust in analytics and reporting.
Excellence in Innovation Award for strengthening data governance and security by enforcing IAM, encryption (AES- 256), and access controls across data pipelines, successfully supporting HIPAA-compliant analytics and regulatory audits.
EDUCATION
Master of Science in Applied Computer Science – Northwest Missouri State University Bachelor of Science in Electronics and Communications Engineering- Tirumala Engineering College CERTIFICATIONS
Databricks Certified Data Engineer Associate Databricks Oracle Certified Associate (OCA) Oracle
AWS Certified Developer – Associate Amazon Web Services