Data Engineer with Scalable Pipelines and Spark Expertise

Location:

Louisiana, MO

Posted:

March 26, 2026

Contact this candidate

Resume:

NAGABHAIRU VINOD KUMAR

St Louis, USA +1-774-***-**** ******************@*****.*** LinkedIn

SUMMARY

Data Engineer with 4+ years of experience, transitioning from data analysis to building scalable data pipelines across healthcare, marketing analytics, and consulting environments. Designs scalable data pipelines and distributed processing workflows using Python, SQL, Spark, and PySpark. Experience with Databricks, Apache Airflow, Kafka, dbt, and AWS building batch and streaming pipelines supporting analytics, machine learning, and operational reporting. Collaborates with data science, analytics, and engineering teams to deliver reliable production datasets.

PROFESSIONAL EXPERIENCE

DATA ENGINEER Jul 2025 – Present

McKesson USA

● Architected enterprise Python and SQL data pipelines that integrated pharmaceutical distribution and inventory records into Delta Lake on Databricks, supporting analytics across 8 healthcare operations teams.

● Processed healthcare supply chain transactions exceeding 120+ million records using Apache Spark and PySpark, generating operational datasets used for medication distribution monitoring and inventory planning.

● Established real-time logistics ingestion streams with Apache Kafka, capturing shipment and warehouse updates across 18 event topics delivered into AWS S3 supporting supply chain visibility analytics.

● Directed enterprise workflow scheduling through Apache Airflow, coordinating 30+ production pipeline runs that handled pharmaceutical distribution data across Databricks and downstream operational reporting layers.

● Generated 28 analytics tables using dbt and SQL enabling operations teams to monitor nationwide medication inventory turnover and distribution efficiency.

● Prepared machine learning training datasets with Python, Databricks, and MLflow, organizing large healthcare logistics records consumed by data science teams building demand forecasting models.

● Containerized Spark processing services using Docker and deployed workloads on Kubernetes, stabilizing daily execution of high-volume pharmaceutical transaction pipelines across production data environments. DATA ENGINEER Nov 2024 – Jun 2025

Epsilon USA

● Built Python and SQL ingestion pipelines that captured customer interaction data from advertising platforms into AWS S3, organizing 90+ campaign datasets for marketing analytics teams.

● Transformed raw healthcare supply chain data into structured datasets with PySpark, enabling real-time monitoring of medication distribution and enhancing inventory demand planning for 10M+ transactions.

● Established Apache Kafka streaming pipelines that captured real-time customer activity and ingested 12 event topics into S3 supporting near-real-time campaign analytics workflows.

● Automated enterprise data workflows via Apache Airflow, coordinating 25 automated pipeline runs that processed advertising and customer engagement data across AWS Glue and analytics reporting layers.

● Produced transformation models using dbt and SQL, generating 30+ curated analytics tables in Snowflake, enabling marketing analysts to build audience segmentation and campaign reporting dashboards.

● Enhanced batch pipeline reliability by tuning PySpark workloads and AWS Glue jobs handling high-volume marketing datasets, reducing refresh delays across daily campaign analytics pipelines. DATA ANALYST Jun 2020 – Aug 2022

KPMG INDIA

● Analyzed financial data from SAP ERP using Excel and Python (Pandas), delivering structured datasets used by 4 business teams, standardizing reporting across monthly and quarterly reporting cycles.

● Validated and reconciled large transaction datasets (1M+ records) using SQL, identifying inconsistencies and minimizing recurring data errors across validation cycles.

● Developed ETL workflows using Informatica and SQL Server, standardizing financial data and generating 20+ reporting tables, reducing manual data preparation for regulatory submissions

● Built reporting datasets in Snowflake supporting Tableau dashboards and KPI tracking, reducing analysis turnaround time by enabling same-day access to audit metrics during compliance reviews

● Optimized reconciliation queries in PostgreSQL, reducing execution time by 2–3 minutes per query, accelerating analysis across high-volume financial datasets.

● Integrated financial data from 7 source systems including SAP using Python, reducing manual consolidation effort and improving consistency for audit investigations

● Delivered analysis-ready datasets and reports to stakeholders, supporting faster identification of financial discrepancies and enabling timely audit decisions during review cycles TECHNICAL SKILLS

● Programming Languages: Python, SQL, Scala, Java

● Data Processing & Big Data: Apache Spark, PySpark, Apache Kafka, Hadoop

● Data Engineering & Pipelines: ETL Pipelines, Data Ingestion, Data Transformation, Batch Processing, Streaming Data Pipelines

● Workflow Orchestration: Apache Airflow, Prefect, Dagster

● Data Warehousing: Snowflake, Amazon Redshift, Google BigQuery, Databricks

● Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

● AWS Data Services: Amazon S3, AWS Glue, Amazon Redshift, AWS Lambda, Amazon EMR, AWS Athena

● Databases: PostgreSQL, MySQL, SQL Server, MongoDB

● Data Modeling: Dimensional Modeling, Star Schema, Snowflake Schema, Data Architecture

● Machine Learning Data Support: Feature Engineering Pipelines, ML Data Pipelines, MLflow

● DevOps & Engineering Practices: Git, Docker, Kubernetes, CI/CD, Terraform

● Data Analysis Libraries: Pandas, NumPy

● Tools & Collaboration: Jira, Agile Scrum

EDUCATION

Master of Science in Data Analytics May 2024

Clark University Worcester, MA

Master of Technology in Software Engineering May 2022 Vellore Institute of Technology Amaravati, India PROJECTS

Object Detection System using Raspberry Pi

Python, OpenCV, Raspberry Pi, MobileNet-SSD / YOLO

● Designed a real-time object detection pipeline using Python, OpenCV, and a Raspberry Pi camera module, analyzing continuous video frames through a MobileNet-SSD model for visual object recognition.

● Executed frame-processing workflows that analyzed 30 FPS video streams, extracting visual features and producing object detections with bounding box coordinates.

● Rendered detected object labels and bounding boxes on a connected display, converting raw camera feeds into structured visual outputs for real-time monitoring.

Real-Time Data Pipeline for E-Commerce Analytics

Python, Apache Kafka, PySpark, Apache Airflow, AWS S3, Snowflake

● Constructed a real-time data ingestion pipeline using Apache Kafka and Python, capturing simulated e-commerce clickstream events and streaming 50K+ user activity records daily into AWS S3.

● Transformed streaming event data through PySpark processing jobs that cleaned and aggregated customer interaction logs before loading structured datasets into Snowflake analytics tables.

● Scheduled batch transformation workflows using Apache Airflow, orchestrating 12 automated pipeline tasks, delivering daily sales, product engagement, and customer behavior metrics. CERTIFICATIONS

● IBM Data Science Professional Certificate — Coursera

● Python for Everybody Specialization— Coursera

Contact this candidate