Varun Kumar Atkuri
Email: ******************@*****.***
Phone: 409-***-****, Location: Dallas, Texas
LinkedIn: https://www.linkedin.com/in/varunkumaratkuri/
PROFESSIONAL SUMMARY:
Results-oriented Data Engineer with 5+ years of experience building scalable, cloud-native data pipelines across Healthcare and Retail domains. Proficient in the modern data stack, specializing in Snowflake (Snowpark), dbt, Python, and the complete AWS Ecosystem (S3, EMR, Glue, Kinesis). Hands-on experience in developing modular data models, implementing incremental ELT workflows, and supporting analytics and Machine Learning teams with high-quality, HIPAA-compliant datasets. Strong focus on pipeline performance, cost optimization, and data governance best practices in Agile environments.
PROFESSIONAL EXPERIENCE:
McKesson Data Engineer USA, Remote Aug 2024 – Present
Project: Enterprise Healthcare Data Lake and Analytics Platform (AWS Cloud)
Architected and built layered dbt models in Snowflake, designing modular data marts (Facts/Dimensions) to support high-performance analytics and downstream ML pipelines.
Engineered point-in-time feature tables (using "as-of" joins) to ensure historical data accuracy for the Data Science team's disease prediction models.
Implemented incremental dbt strategies for TB-scale fact tables, utilizing surrogate keys and watermark columns to reduce daily compute costs by 18%.
Built robust ingestion pipelines using Snowpipe, automatically loading raw JSON/CSV data from Amazon S3 into Snowflake staging tables with near real-time latency.
Extended dbt transformations with Snowpark Python UDFs to parse complex HL7 and ICD/CPT codes, enabling efficient in-warehouse data cleaning.
Orchestrated end-to-end data workflows using Apache Airflow, using sensors to validate data availability before triggering dbt transformations to meet the 15-minute freshness SLA.
Established CI/CD pipelines using GitHub Actions, automating SQL linting (SQLFluff), dbt model testing, and production deployments to ensure code quality.
Automated data integration with AWS SageMaker, exporting cleaned datasets to S3 for model training and ingesting inference results back into Snowflake.
Designed and deployed interactive Tableau dashboards connected to Snowflake Data Marts, visualizing patient risk scores and operational KPIs to enable clinical decision-making.
Implemented Data Quality (DQ) safeguards using dbt tests (unique, not_null) and custom SQL checks, creating quarantine flows that reduced data defects by 12%.
Managed Infrastructure as Code (IaC) using Terraform to provision AWS resources (S3, IAM, Glue) and Snowflake objects, ensuring environment consistency.
Applied HIPAA-aligned security controls, implementing Snowflake RBAC, Dynamic Masking policies, and AWS KMS encryption to protect sensitive PII/PHI data.
Optimized Snowflake performance by configuring Clustering Keys and analyzing Query Profiles, improving retrieval speeds on large transaction tables by 40%.
Configured Observability using Amazon CloudWatch and Snowflake Resource Monitors to track pipeline failures and credit usage, preventing budget overruns.
Tech Stack: Python, SQL, dbt, Snowflake (Snowpipe, Snowpark, Tasks), AWS (S3, Glue, IAM, KMS, CloudWatch), Apache Airflow, GitHub Actions, Terraform, Tableau, Git.
Nisum Technologies Data Engineer Hyderabad, India Nov 2021 – Aug 2023
Project: Medical Claims Processing & Analytics Platform (AWS)
Developed and maintained scalable batch ETL pipelines for processing high-volume Medical Claims data, collaborating with architects to ensure accurate financial reporting.
Built raw data ingestion workflows, loading daily claims dumps and member eligibility files from SFTP servers into Amazon S3 (Bronze Layer) using Python scripts and AWS Lambda triggers.
Wrote complex PySpark transformations on Amazon EMR to validate claim headers, standardize provider NPIs, and compute "allowed vs. billed" amounts, processing 20M+ records per batch.
Migrated legacy SQL logic to Spark SQL on EMR, transforming raw data into optimized Parquet format (Silver Layer) on S3 to handle schema evolution and improve query performance.
Contributed to real-time data streaming using Amazon Kinesis Data Streams, capturing claim status changes (Approved/Denied/Pending) to feed live operational dashboards.
Orchestrated end-to-end data workflows using AWS Step Functions, coordinating dependencies between EMR Spark jobs, Glue Crawlers, and Redshift loads with automated retry logic.
Optimized EMR PySpark jobs by implementing partitioning by Claim_Date and Service_Type, and broadcasting lookup tables (Member/Provider), reducing job runtime by 25%.
Developed Redshift Data Marts (Gold Layer) for the finance team, designing tables with appropriate Distribution and Sort Keys to speed up queries on aggregate claim costs.
Implemented Data Quality checks in the pipeline (using PySpark) to validate Nulls/Duplicates, effectively reducing data defects in the final Redshift reports by 25%.
Supported the DevOps culture by deploying AWS infrastructure (S3 buckets, IAM Roles, EMR Security Configs) using Terraform and managing code versions via Git/GitHub Actions.
Monitored production pipelines using Amazon CloudWatch, creating alarms for job latencies and failures, and performed root cause analysis (RCA) to resolve P1/P2 data incidents.
Assisted in cataloging data assets using AWS Glue Data Catalog, ensuring all tables in the Data Lake were discoverable for downstream analytics users in Athena.
Collaborated on cost-optimization tasks, configuring EMR Managed Scaling policies to scale down worker nodes during off-peak hours, saving 15% on compute costs.
Enforced security standards by applying IAM policies and S3 Bucket Policies, ensuring strict access controls for PII/PHI data in compliance with HIPAA regulations.
Tech Stack: Python, SQL, AWS EMR (PySpark), S3, Kinesis, Redshift, Athena, Glue, Step Functions, Terraform, Docker, CloudWatch, IAM, Lake Formation, Tableau, Git.
Excers.com Data Engineer Hyderabad, India Mar 2020 – Oct 2021
Project: Retail Analytics & Inventory Migration (Hadoop to AWS)
Developed and maintained Informatica mappings to migrate daily Point-of-Sale (POS) and inventory data (~80 GB/day) from on-prem Hive tables to Amazon Redshift, ensuring 99.9% data consistency across retail stores.
Assisted in the management of AWS infrastructure by configuring IAM Roles, S3 Bucket Policies, and Security Groups to ensure secure, least-privilege access for data pipelines and analytics users.
Built PySpark scripts to process unstructured clickstream data, converting raw CSV and JSON files into partitioned Parquet format in S3 to reduce storage costs and optimize query performance.
Implemented incremental load logic using Redshift staging tables and Informatica task flows to synchronize daily stock updates, reducing the overall ETL processing window by 2 hours.
Configured SSE-KMS encryption for S3 data at rest and automated bad-data handling by routing failed records to a Quarantine S3 bucket for manual UAT and reconciliation.
Monitored Redshift cluster health and performance, performing routine VACUUM and ANALYZE operations and collaborating with seniors to apply Sort and Distribution keys.
Scheduled and orchestrated multi-stage data dependencies in Apache Airflow (DAGs), utilizing parameterized variables to handle automated retries and backfills for sale events.
Created CloudWatch alarms to notify the team of pipeline failures and built Grafana dashboards to track daily load completion times and data freshness metrics.
Utilized CloudFormation templates to assist in the deployment of S3 buckets and lifecycle policies, ensuring automated archival of historical retail data to S3 Glacier.
Maintained Source-to-Target (STTM) mapping documentation and technical runbooks to facilitate smooth knowledge transfer for the production support team.
Tech Stack: Python, SQL, Informatica, Apache Spark (PySpark), Hive, AWS (S3, Redshift, CloudWatch, IAM, CloudFormation), Airflow, Grafana, Bitbucket
EDUCATION:
University of North Texas, Denton, TX — M.S. Data Science, GPA 3.9 (Aug 2023 – May 2025)
Kalasalingam University, Tamil Nadu, India — B.S. Mechanical Engineering, GPA 3.3 (Jun 2017 – Mar 2021)
CERTIFICATIONS:
AWS Certified Data Engineer – Associate
AWS Certified Machine Learning Engineer – Associate
AWS Certified Cloud Practitioner
SnowPro Core Certified
Fundamentals of the Databricks Generative AI
Prompt Engineering for ChatGPT by Vanderbilt University (Coursera)
TECHNICAL SKILLS:
Programming & Scripting
Python, SQL, Java, Scala, Bash/Shell, Linux
Data Engineering & ETL
Apache Airflow, AWS Glue, Azure Data Factory (ADF), dbt, Informatica, Talend, Apache NiFi, SSIS, Fivetran
Cloud Platforms & Storage
Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
Data Warehousing & Lakes
Snowflake, Amazon Redshift, Google BigQuery, Hadoop (HDFS)
Big Data & Stream Processing
Apache Spark (PySpark), Apache Hadoop, Azure Databricks, Apache Kafka, Apache Flume, Google Pub/Sub
Databases (SQL & NoSQL)
Oracle, MySQL, PostgreSQL, MongoDB, Cassandra, Spark SQL
BI & Visualization
Tableau, Power BI, SSRS
Monitoring & Observability
Grafana, CloudWatch, ELK Stack, Dynatrace, Quantum Metric
DevOps & CI/CD
Git, GitHub, Bitbucket, Jenkins, Docker, Kubernetes, Terraform, AWS CloudFormation
Data Governance & Quality
dbt Tests, AWS Deequ, Glue Data Quality, RBAC, Masking Policies
MLOps & Feature Engineering
Snowpark, SageMaker, MLflow, Feature Store Concepts, Model Monitoring
IDEs & Notebooks
Jupyter Notebook, IntelliJ IDEA, Visual Studio Code, Google Colab
Machine Learning & AI
Scikit-learn, spaCy, NLTK, Basic Deep Learning, LLM Concepts, Prompt Engineering