HEMA B
Sunnyvale, CA,***** (***)-(***)-**** *****.*.****@*****.*** linkedin.com/in/hema-data
PROFESSIONAL PROFILE
Senior Data Engineer with 8 years’ experience in building Python ETL pipelines and content migration solutions on AWS and GCP. Designed and optimized AWS Glue and PySpark jobs to migrate and transform complex data into Redshift, improving performance and reliability. Developed Python-based Dataflow pipelines and automation scripts for data validation and monitoring. PROFESSIONAL EXPERIENCE
Anthem Oct 2024 - Present
Senior Data Engineer Sunnyvale, CA
• Developed and optimized BigQuery datasets, tables, views, and SQL procedures to support analytics and executive reporting; implemented partitioning and clustering on date and key columns to reduce query costs and improve performance
• Built and maintained Cloud Composer (Airflow) DAGs to orchestrate daily batch data loads from multiple on-prem and cloud source systems into BigQuery with dependency management and failure alerting, ensuring on-time data availability for downstream analytics
• Developed Python-based Dataflow (Apache Beam) pipelines for large-scale data ingestion and content migration from Teradata source systems into GCP, ensuring reliable and scalable batch processing.
• Implemented data validation and quality checks using Python scripts post-load; monitored pipeline health via Cloud Monitoring dashboards and resolved production issues with root-cause documentation
• Built infrastructure for optimal extraction, transformation, and loading from multiple source systems with partitioned BigQuery tables for performance and cost optimization
• Implemented data validation, reconciliation, anomaly detection scripts in Python to verify pipeline outputs against source system metrics post-load
Netenrich Technologies Nov 2019 - Jun 2022
Data Engineer Hyderabad, India
• Built AWS Glue ETL jobs using Spark SQL to process banking transaction data extracted from mainframe systems, delivering reliable daily feeds to Redshift
• Built AWS Glue PySpark ETL pipelines to process banking transaction data from mainframe sources into Redshift, implementing incremental load patterns using job bookmarks to process only new or changed records.
• Created Redshift tables with appropriate distribution and sort keys, which reduced query runtime for common analytical reports
• Wrote SQL validation scripts to compare record counts, sums, and key metrics between source and target systems, ensuring data accuracy before loading into Redshift
• Designed and built data integration pipelines that ingested banking transaction data from mainframe and API sources into AWS (S3
Glue Redshift), reducing data latency to under 2 hours and enabling timely downstream analytics.
• Created AWS Lambda functions in Python to automatically trigger ETL jobs via CloudWatch events when new files arrived in S3 buckets, ensuring timely and reliable data ingestion.
• Mentored junior engineers on modular, reusable code practices and CI/CD workflows using GitHub and Jenkins to improve team development velocity
• Built reusable AWS Glue PySpark scripts and Step Functions workflows for standardized multi-step ETL orchestration, including error handling and retry logic.
• Supported production workloads with structured troubleshooting, root-cause analysis, and enhancements to data quality checks and job reliability.
• Developed Python scripts for data process maintenance, file validation, monitoring, and automated notifications to support production reliability
• Implemented incremental load patterns (Glue job bookmarks) and workload management strategies to optimize pipeline performance and reduce processing time
• Wrote SQL validation and reconciliation scripts to verify record counts, sums, and key metrics between source and target systems
• Trained two junior team members on AWS service fundamentals and PySpark development, enabling them to build and deploy data pipelines independently and accelerating project delivery. Mtouch Labs Aug 2016 - Nov 2019
Data Engineer Hyderabad, India
• Developed Informatica PowerCenter workflows to extract data from Oracle and SQL Server source systems, delivering reliable source feeds that enabled downstream analytics pipelines
• Defined and developed ETL solutions with Informatica PowerCenter to integrate Oracle and SQL Server on-premises data into a SQL Server data warehouse, reducing data load time by streamlining the pipeline
• Wrote complex SQL queries in Oracle and SQL Server to validate data, reconcile differences between systems, and generate audit reports, which improved data accuracy and strengthened compliance monitoring
• Built Python scripts employing pandas and smtplib to automate manual data quality checks, file processing, and notification tasks, eliminating manual effort and ensuring timely error alerts
• Designed star-schema data models (dimension and fact tables) for healthcare and manufacturing clients, enabling faster report generation and more effective analytics, which improved decision-making speed
• Created detailed technical documentation including source-to-target mappings, transformation rules, and data dictionaries
• Performed unit testing, data validation, and verification procedures to ensure delivered solutions met functional and non-functional requirements before production deployment
• Worked with business users to understand data requirements and translate them into technical specifications
• Managed code deployments across dev, test, and production environments following change control and DevOps governance procedures
• Performed code deployments between development, testing, and production environments following change control procedures
• Created Bash shell scripts on Linux to automate file transfers, schedule jobs with cron, and replace Informatica processes, reducing manual effort and processing time
• Set up basic monitoring for ETL jobs using email alerts and log file checks, which enabled early detection of failures and data issues and reduced downtime
• Assisted in migrating client data to new Oracle and AWS Glue platforms, using Python scripts and Spark SQL to transform and load data, ensuring the migration finished on schedule with no data loss
• Maintained version-controlled change logs and update history for ETL mappings and workflow configurations in Git, enabling quick rollback of changes and improving auditability for the team EDUCATION
Rajamahendri Institute of Engineering and Technology Jun 2012 - May 2016 Bachelor of Technology, Computer Science India
• GPA: 3.87
Gannon University Aug 2022 - May 2024
Master's degree in Artificial Intelligence USA
• GPA: 3.94
TECHNICAL SKILLS
• Cloud Platforms: GCP, BigQuery, DataProc, Cloud Composer, GCS, AWS, Glue, Redshift, S3, Lambda, Step Functions, Cloud Run, GKE, Cloud Functions, Bigtable
• Big Data Tools: PySpark, Spark SQL, Airflow, Informatica PowerCenter, Spark, Content Migration
• Programming: Python, SQL, Shell Scripting
• Databases: PostgreSQL, MySQL, Oracle, SQL Server, BigQuery, Redshift
• DevOps & Tools: Git, Jenkins, Docker, Jira, Linux, Kubernetes, CI/CD (Cloud Build/GitHub Actions) CERTIFICATIONS
• Google Cloud Professional Data Engineer
• AWS Certified Data Analytics – Specialty