Summary
SAI RAMA KRISHNA GANGAVARAPU
California, USA +1-925-***-**** *********.***********@*****.*** LinkedIn Data Engineer with over 4+ years of experience designing and building scalable, cloud-native data pipelines. Expert in leveraging Azure Databricks (Spark), Azure Data Factory
(ADF), and AWS Glue to orchestrate ETL processes, optimize data lakes (Delta Lake, ADLS, S3), and ensure high data quality for analytics. Proficient in Python, PySpark, and SQL, with a proven track record of reducing processing times by 25%, cutting cloud costs by 18%, and automating workflows to drive efficiency. Adept at cross-functional collaboration in Agile environments to deliver data solutions that empower stakeholders and support business. Skills
• Cloud Platforms: Microsoft Azure (Data Factory, Databricks, Data Lake Storage Gen2, DevOps, SQL Database, Synapse), AWS (EMR, Glue, S3, Athena, Redshift, IAM), GCP (BigQuery, Cloud Composer)
• Big Data & Processing Frameworks: Apache Spark (PySpark, Spark SQL), Delta Lake, Hadoop
• Programming & Scripting Languages: Python (Pandas, PyTest), SQL
(Query Optimization, Tuning)
• Data Warehousing & Databases: Azure SQL Database, Google BigQuery, AWS Redshift, AWS Athena
• ETL/ELT & Data Pipeline Orchestration: Azure Data Factory (ADF), AWS Glue, Apache Spark, Data Ingestion Frameworks
Professional Experience
• Data Governance & Quality: Data Validation, Anomaly Detection, Data Quality Checks, AWS Glue Data Catalog
• CI/CD & DevOps: Azure DevOps, CI/CD Pipelines, Automated Deployment, Infrastructure Management
• BI & Visualization Tools: Power BI, Dashboard Development, Reporting
• Data Optimization & Performance: Performance Tuning, Z-Ordering, OPTIMIZE, Liquid Clustering, Cost Optimization
• Development Practices: Agile/Scrum Methodology, Code Review, Technical Documentation, Peer Collaboration
Data Engineer 08/2024 to Current
MetLife USA
• Engineered a real-time data ingestion pipeline using Apache Kafka and AWS Kinesis to process 2 million daily policy transactions, reducing data latency from hours to under 60 seconds for actuarial teams.
• Architected a Delta Lake house on Databricks leveraging Unity Catalog, enabling schema enforcement and time travel for 15 TB of historical claims data, improving data reliability for analytics by 70%.
• Developed automated data quality frameworks using Great Expectations within Airflow DAGs, identifying and resolving 500+ data anomalies monthly across critical financial datasets.
• Migrated legacy on-premise SQL Server ETL processes to Azure Data Factory and Synapse Analytics, reducing data processing costs by $12,000 per month.
• Optimized complex PySpark jobs by implementing adaptive query execution and predicate pushdown, cutting down average job runtime from 45 minutes to 20 minutes.
• Implemented a CI/CD pipeline for Databricks notebooks using Azure DevOps, reducing deployment time for new code from 3 days to 4 hours and ensuring environment consistency.
• Built a scalable feature store for ML models in Azure Machine Learning, accelerating model training cycles by 30% for the data science team.
• Automated the provisioning of data infrastructure using Terraform modules, standardizing deployments across 3 environments and reducing setup time from 2 days to 3 hours.
• Designed and deployed a company-wide data catalog with Azure Purview, automating data lineage discovery for over 500 datasets and improving governance compliance.
• Orchestrated end-to-end data pipelines with Apache Airflow, managing dependencies for 50+ daily jobs and improving overall workflow reliability by 90%. Data Engineer 05/2020 to 07/2022
Paytm India
• Processed 5M+ daily payment transactions using PySpark on AWS EMR, to structure raw JSON data, reducing null values for the fraud analytics team.
• Automated data quality checks with Python (Pandas), to validate 200+ payment gateway files daily, cutting manual verification time from 8 hours to 45 minutes for finance operations.
• Migrated 3 legacy Hadoop jobs to AWS Glue, to transform merchant settlement data, improving processing speed by 2.5x for daily payout reports.
• Developed a Power BI dashboard, to monitor 15+ ETL pipelines, reducing incident detection time from 3 hours to 45 minutes for the DevOps team.
• Standardized payment reconciliation using AWS Athena, to enable accurate matching of 200M+ daily transactions for the accounting department.
• Optimized Redshift SQL queries, to accelerate customer segmentation reports, decreasing runtime from 2 hours to 30 minutes for the marketing team.
• Created Python automation scripts, to generate weekly SLA reports, saving 6 hours of manual work for senior management reviews.
• Designed a data validation framework with PyTest, to catch 95% of anomalies pre-production, preventing 10+ monthly pipeline failures.
• Assisted in building a data catalog in AWS Glue, to document 50+ critical datasets, enabling self-service analytics for 20+ business users.
• Configured S3 lifecycle policies, to automatically archive 12TB of payment logs, saving $3,000/year in storage costs.
• Transformed 10+ payment gateway schemas using PySpark, to enable unified reporting, reducing reconciliation discrepancies by 30%.
• Collaborated on a CI/CD pipeline with AWS CodePipeline, to deploy ETL jobs, reducing deployment errors by 70%. Associate Data Engineer
Paytm
• Developed and optimized high-volume data pipelines using Apache Spark (PySpark) and SQL to support analytics and machine learning.
• Managed and processed large datasets in distributed computing environments using Hadoop ecosystem tools like Hive and HDFS.
• Built and maintained data warehouse solutions on AWS Redshift, enabling efficient storage and fast query performance for business intelligence.
• Created and scheduled automated data workflows using Apache Airflow for reliable and monitorable data processing tasks.
• Ensured data reliability and performed data validation and profiling to maintain high data quality standards across all datasets. Education
05/2019 to 04/2020
India
Master of Science: Computer Science 07/2024
California State University East Bay United States Projects
Academic Performance Prediction (AugmentED)
• Developed a multi-model ML system (LSTM, XGBoost) predicting student performance from behavioural data across 7 domain combinations.
• Automated end-to-end ML pipelines with Azure Machine Learning, enabling efficient hyperparameter tuning and model selection.
• Deployed scalable REST API with Azure Container Instances and delivered interactive Power BI dashboards for real-time academic risk profiling. NYC Taxi Data Pipeline - Delta Lake & Databricks
• Built medallion architecture pipeline processing large-scale taxi data with schema validation and Delta Lake time travel.
• Optimized performance using ZORDER, VACUUM and dynamic partitioning, reducing query latency and storage costs.
• Enabled seamless data access via Unity Catalog and created Databricks SQL dashboards visualizing trip trends and demand patterns. Superstore Sales Analysis Dashboard
• Designed interactive sales dashboard with drill-down capabilities across regions, categories and time periods.
• Developed optimized data model using DAX measures, improving report responsiveness and load time.
• Published automated reports via Power BI Service enabling stakeholder access to real-time sales insights. Certification
• Microsoft Certified: Fabric Data Engineer Associate