Sai Teja M
Technology: Data Engineer
Total Experience: 5+ Years
Email ID: *********.*@*****.*** Phone Number: 804-***-**** PROFESSIONAL SUMMARY
• Data Engineer with over 5 years of experience building cloud-native data solutions across AWS, Azure, and GCP. Skilled in designing data lakes, warehouses, and real-time streaming pipelines using tools like Spark
(Scala/PySpark), Kafka, Snowflake, Databricks, and BigQuery.
• Hands-on in ETL development and orchestration using Airflow, NiFi, ADF, and Informatica, integrating diverse data from APIs, flat files, and RDBMS. Experienced in real-time processing with Kafka, Kinesis, Event Hubs, and Spark Streaming for actionable analytics.
• Built secure data platforms with IAM, KMS encryption, masking, and HIPAA/GDPR compliance. Managed CI/CD pipelines using GitHub, Bitbucket, GitLab, Jenkins, and Azure DevOps. Proficient in Docker and Kubernetes for containerized data workloads.
• Collaborated with cross-functional teams to deliver ML feature pipelines, BI-ready datasets, and automated data quality checks using PyTest, ScalaTest, Deequ, and Great Expectations. Supported real-time analytics and reporting via Power BI, Looker, Tableau, and Athena.
• Optimized large-scale data processing in AWS EMR, Databricks, and GCP Dataproc, while maintaining cost efficiency, performance tuning, and data security standards. Passionate about building reliable, scalable, and production-grade data systems that drive business insights. TECHNICAL SKILLS
• Programming Languages: Python, Java, R, SQL, Scala, Shell Scripting
• Big Data Tools: Apache Spark, Hadoop, Hive, Pig, HDFS, Sqoop, Kafka, Flume
• ETL & Workflow Orchestration: Apache Airflow, AWS Glue, Informatica, Apache NiFi, SSIS
• Cloud Platforms: AWS (S3, Glue, Redshift, Lambda, EMR, Athena), Azure, Google Cloud Platform (GCP)
• Databases: Oracle, PostgreSQL, SQL Server, MongoDB, Snowflake
• Streaming Technologies: Apache Kafka, Spark Streaming, AWS Kinesis
• Data Visualization: Power BI, Tableau, Looker
• DevOps & Infrastructure: Git, Bitbucket, Jenkins, Terraform, Docker
• Other Tools & Frameworks: Databricks, Delta Lake, YARN, Confluent Schema Registry, JIRA, Confluence PROFESSIONAL EXPERIENCE
Liberty Mutual Insurance
July 2024 - Present Role: Data Engineer
• Built AWS S3 data lake using Parquet/Avro formats with schema standardization for claim and policy data.
• Developed ETL pipelines in Spark (Scala) on Databricks, integrating data from JSON, CSV, APIs, and Oracle.
• Migrated ETL from Informatica to Spark, improving cost and scalability.
• Enabled real-time policy updates with Kafka to Snowflake streaming pipelines for near real-time analytics.
• Created Snowflake star schema models, secured with row-level access, column masking, and GDPR compliance.
• Automated workflows using Airflow DAGs, with retries, SNS alerts, and checkpointing.
• Built unit tests with ScalaTest, validated data transformations, and logged pipelines via Log4j and CloudWatch.
• Integrated external APIs (weather, vehicle) for enriched risk models.
• Deployed Dockerized pipelines on Kubernetes, managed CI/CD using GitHub, GitFlow, Jenkins, and collaborated in Agile (Jira).
Key Bank
June 2021-July 2023 Role: Data Engineer
• Built a data lake on Azure ADLS for loan and transaction data using Parquet/ORC, structured for efficient access.
• Developed ETL pipelines in PySpark (Databricks), integrating CSV, JSON, AVRO, SQL Server, and API data with schema standardization.
• Enabled real-time transaction ingestion via Event Hubs & Structured Streaming, pushing enriched data to consumers.
• Migrated Teradata to Snowflake, rewriting logic with Snowflake SQL/UDFs and building Star Schema models for OLAP workloads.
• Ensured data security with Azure RBAC, Snowflake masking, KMS encryption, and row-level security for compliance.
• Automated workflows via ADF, logged pipelines in Azure Log Analytics, and enforced data quality with PyTest.
• Published curated data using Hive SQL and Power BI, enabled self-service Snowflake querying for analysts.
• Managed CI/CD with Bitbucket and Azure DevOps, captured lineage and metadata for governance, and created ML feature sets in Snowflake.
• Worked in Agile sprints via Jira, collaborating with InfoSec, risk, and finance teams. UnitedHealth Group (UHG)
June 2019 – May 2021 Role: Data Engineer
• Built batch and real-time pipelines in Spark (Scala) on GCP Dataproc, processing claims and pharmacy data into a GCS data lake.
• Designed AVRO/JSON/CSV ingestion, exposed BigQuery data via Cloud Functions APIs, and integrated Kafka streaming for EDI feeds.
• Migrated ETL from SQL Server/Oracle to BigQuery, optimized pipelines, and developed Star Schema data marts to cut load times by 70%.
• Implemented data quality checks (Deequ), exception handling, and lineage capture for 100+ pipelines.
• Secured data with IAM roles, KMS encryption, masking, ensuring HIPAA/GDPR compliance.
• Deployed Dockerized Spark jobs via Cloud Composer (Airflow) and built ML feature stores in BigQuery.
• Published datasets to Looker/Data Studio, logged pipelines via Log4j/Stackdriver, and managed CI/CD with GitLab.
• Led Agile delivery via Jira, supporting the on-prem to GCP migration and collaboration with cross-functional teams.
EDUCATION
Old Dominion University
Master of Science in Computer Science