Hema Venkata Sai Dumpala
Data Engineer
Milpitas, CA 408-***-**** ************@*****.*** Linkedin Github Leetcode
SUMMARY
Results-driven Data Engineer with 5 years of end-to-end data analytics and engineering experience, including over 2 years in the healthcare domain, delivering Scalable and secure data solutions across cloud platforms. Skilled in building and optimizing ETL pipelines, real-time data workflows, and data lakes using AWS (Glue, S3, Redshift, Lambda, Athena) and Microsoft Azure (Data Factory, Data Lake, Synapse, Blob Storage). Proficient in Python, PySpark, Scala, Spark SQL, and R for large-scale data processing, cleansing, and transformation, with expertise in designing modular, high-performance data architectures and implementing data governance best practices. Experienced in SQL optimization, Tableau-based data visualization, and cross-functional collaboration in Agile environments. Holds a Master’s degree in Data Analytics with a strong foundation in cloud-native architecture, healthcare data compliance (HIPAA), and the ability to translate complex datasets into actionable business insights.
EDUCATION
San Jose State University, CA, USA Jan 2023 - Dec 2024
Master of Science in Data Science
Acharya Nagarjuna University, India Jun 2016 – Apr 2019
B.Tech in Computer Science and Engineering
TECHNICAL SKILLS
Programming Languages: C, J2EE, SQL, Pig Latin, HiveQL, Scala, Python, UNIX Shell Scripting.
Databases: MS-SQL SERVER, Oracle, MS-Access, MySQL, Teradata, PostgreSQL, DB2.
Big Data Technologies: Yarn, MapReduce, Pig, Hive, HBase, Cassandra, Oozie, Apache Spark, Scala, Impala, Kafka.
Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon EMR (EMR, EC2, EBS, RDS, S3, Glue, Elasticsearch, Lambda, Kinesis, SQS, DynamoDB, Redshift, ECS) Azure HDInsight (Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory), IBM Cloud.
NoSQL Database: Cassandra, MongoDB.
Reporting Tools/ETL Tools: Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI.
Methodologies: Agile/Scrum, Waterfall.
Development Tools: Eclipse, NetBeans, IntelliJ, Hue, Microsoft Office Suite (Word, Excel, PowerPoint, Access)
Others: Machine learning, NLP, Stream Sets, Spring Boot, Jupyter Notebook, Docker, Kubernetes, Jenkins, Jira.
PROFESSIONAL EXPERIENCE
Blue Cross Blue Shield, CA Feb 2025 - Current
Data Engineer
Built Scalable ETL pipelines using Apache Spark on AWS Glue to process and transform large volumes of healthcare claims and member data from diverse RDBMS and semi-structured sources into Amazon S3 and Snowflake.
Performed advanced data cleansing and transformation with PySpark, improving the accuracy of patient demographics, provider directories, and claims records for downstream analytics.
Integrated Snowflake with AWS S3 using Snowpipe for real-time data ingestion, enabling up-to-date access to policy and enrollment records across actuarial and compliance teams.
Developed and orchestrated robust ETL workflows using AWS Glue Workflows, Crawlers, and Lambda, ensuring timely ingestion of high-volume datasets across environments.
Created Tableau dashboards connected to Snowflake and Amazon Redshift, delivering real-time insights on claim approvals, cost trends, and care quality for over 10 business units.
Automated error handling and audit logging using Step Functions and CloudWatch, reducing manual monitoring effort and improving pipeline observability.
Designed HIPAA-compliant data architectures with AWS IAM, KMS encryption, and S3 access controls, ensuring secure storage and controlled access to PII and PHI data.
Tuned Spark jobs to reduce ETL runtime by 30%, supporting faster dashboard refresh cycles and more responsive data access for business stakeholders.
Accenture Pvt. Ltd., Hyderabad, India Nov 2021 – Jan 2023
Data Engineer
Developed raw, intermediate, and master tables using Spark and complex SQL joins, transforming over 1 TB of data from Azure Data Lake Storage and on-prem Hive to support a new market rollout in the Nordic region.
Built and deployed a machine learning model in Jupyter Notebook using Scikit-learn to predict insurance defaults with 15% improved accuracy, integrating results into customer master tables for executive dashboards.
Created reusable Spark application artifacts with JFrog Artifactory, orchestrated via Apache Airflow, ensuring dependency resolution and consistent job execution in daily pipelines.
Optimized Spark jobs by tuning partitioning and caching strategies, reducing job execution time by 40%, which directly improved Power BI dashboard refresh rates and reporting timeliness.
Designed and maintained data models in Snowflake, supporting analytics and BI workloads with Scalable, high-performance data storage for structured and semi-structured data.
Built automated shell scripts to manage schema refresh across DEV, TEST, and PROD environments, ensuring clean pre-loads and reliable Spark job runs for accurate daily reporting.
Authored and optimized complex SQL queries in PostgreSQL for data profiling and validation, reducing data quality issues by 30% and improving dashboard reliability.
Developed and published interactive Power BI dashboards for stakeholders across Scania and insurance business units, enabling insights into sales trends, policy coverage, and risk segmentation.
Leveraged Azure Synapse Analytics and Azure Data Factory (ADF) to orchestrate ETL pipelines, enabling seamless data ingestion, transformation, and load into Snowflake and downstream systems.
Integrated Azure Blob Storage and Azure Key Vault for secure handling of sensitive configuration files and credentials, enhancing compliance with enterprise-grade data governance standards.
Participated in setting up CI/CD pipelines using Azure DevOps, ensuring automated deployment and monitoring of Spark-based data applications across environments.
Collaborated with data architects to assess and modernize legacy workflows by migrating batch processes to Azure-based distributed processing pipelines, reducing operational overhead and processing latency.
Virtusa Pvt. Ltd., Chennai, India Jun 2019 – Oct 2021
Data Engineer
Designed and developed batch and stream processing applications using Apache Spark (SparkSQL, Spark Streaming) and Kafka, enabling real-time ingestion and transformation of FHIR and claims data on IBM Cloud.
Built Scalable ETL pipelines using Apache NiFi, Scala, and Java to automate ingestion and transformation of semi-structured healthcare data into Delta Lake, supporting analytics for policy, claims, and provider networks.
Performed large-scale data integration and orchestration on IBM Cloud Object Storage and IBM DataStage, ensuring consistent throughput and low latency for high-volume clinical and financial data.
Utilized Go to develop custom data ingestion processors in NiFi, improving pipeline throughput by 10% for streaming datasets sourced from hospital and insurer APIs.
Conducted predictive modeling using Python and PyTorch to detect anomalous claim patterns, enhancing fraud detection capabilities and reducing false positives by 10%.
Built Tableau dashboards to monitor KPIs including claim status, approval rates, and customer churn, contributing to a 15% increase in policy renewals through data-driven insights.
Optimized SQL queries, created stored procedures, views, indexes, and triggers for high-volume claims databases, reducing data retrieval times by 40%.
Transformed terabytes of healthcare data from IBM Cloud into optimized Hive external tables using SparkSQL, improving query performance and data availability for downstream users.
Led ETL testing, validation, and deployment across IBM Cloud Pak for Data infrastructure, ensuring compliance with healthcare data standards and operational efficiency.
Managed version control using Git and participated in Agile development cycles via JIRA, supporting iterative delivery of features aligned with IBM Healthcare client requirements.
Collaborated in architectural reviews for IBM Cloud–hosted analytics workflows, delivering performance recommendations that enhanced system Scalability and reduced processing costs.
KPMG, India Sep 2018 – May 2019
Data Analyst
Conducted financial data analysis across 50+ quarterly reports using SQL and Excel, supporting revenue forecasting and budgeting with a 95% prediction accuracy, contributing to investment planning and profitability reviews.
Developed and maintained financial models for scenario analysis related to mergers, acquisitions, and expansion projects, improving strategic decision-making for 3 major initiatives.
Collaborated with cross-functional teams during the annual budgeting process, utilizing advanced statistical models in Excel and interactive Tableau dashboards to enhance financial visibility and improve budget accuracy by 15% across 5 departments.
Performed risk analysis on 20+ investment portfolios annually using Python (Pandas, NumPy) and financial modeling techniques, reducing potential financial exposure by 30% through proactive insights.
Created and managed KPI dashboards for Revenue Cycle Management (RCM), analyzing trends across 20+ financial metrics using Power BI, improving revenue visibility and cycle performance tracking.
Ensured 100% regulatory compliance in data handling and reporting while maintaining a financial database with 1,000+ entries, increasing data reliability and audit readiness by 20%.
Conducted market and competitor research using Tableau, internal datasets, and third-party financial sources, delivering insights that supported the identification of 5+ new market opportunities.
PROJECTS
Enhancing Historical Insight
Developed an AI chatbot for 19th-century US history using LLMs (Llama 2, 3.1, Mistral 7B, GPT-3.5) with RAG and QLoRA fine-tuning for text summarization and Q&A.
High-Performance Data Analytics Platform Development on Snowflake
Developed and launched a data analytics platform on Snowflake Cloud Data Warehouse, which managed over 10 terabytes of data. Crafted and maintained ETL pipelines that processed over 1 million records daily, ensuring data accuracy and accessibility.
California Social Determinants of Health (SDOH) Data Analysis
Developed a Scalable data pipeline using PySpark and AWS EMR to clean, transform, and analyze the California SDOH dataset. Integrated multiple datasets (SDOH, HPSA, AQI, County Health Rankings) to derive actionable insights, supporting policymakers and healthcare professionals. Built interactive dashboards using Tableau for visualizing health trends and disparities.
Image Caption Generator
A multimodal classification using RNNs for visual feature extraction and LSTMs using text embeddings (GloVec, ELMo) for text generation, seamlessly integrating visual and textual classification.
CERTIFICATIONS
Microsoft Azure Fundamentals - AZ900
Oracle Gen AI Professional 2024
AWS Data Engineer Associate
GCP Associate Data Practitioner
MongoDB Associate Developer (Ongoing)