Shanmukh Sai Madhu
Data Engineer
Chicago, IL, USA +1-501-***-**** ***********@*****.*** LinkedIn GitHub SUMMARY
• Data Engineer with 5 years of experience across finance, AI, and public sector projects, specializing in real-time pipelines, ETL orchestration, and cloud analytics on AWS and Azure.
• Build and optimize pipelines using ADF, Databricks, PySpark, Kafka, Airflow, and dbt, processing 25 million+ interactions and high- volume transactional data.
• Manage Snowflake and Synapse environments with strong skills in schema design, SCD patterns, partitioning, and performance tuning for large-scale dashboards.
• Improve governance and reliability using Great Expectations, Purview, and data lineage workflows, supporting 12 plus audits and reducing reconciliation effort.
• Develop analytics assets with Power BI, Tableau, Pandas, and NumPy, driving insights for operations, reporting, and community-focused initiatives.
WORK EXPERIENCE
JP Morgan Chase & Co., Chicago, IL February 2025 – Current Data Engineer
• Built real-time streaming pipelines using Kafka (MSK), AWS S3, and EMR, processing 25 million+ financial transactions daily to modernize fraud monitoring across lines of business.
• Orchestrated batch ingestion pipelines via Azure Data Factory and Apache Airflow, aligning historical refresh cycles with real-time streams for consistent analytics.
• Refactored legacy PySpark processes into dbt models on Databricks, reducing pipeline failures and cutting daily data refresh time from 2 hours to under 30 minutes.
• Designed Snowflake and Redshift data models using star schema and data vault patterns, delivering 200+ standardized KPIs across fraud, treasury, and wealth use cases.
• Embedded data quality and lineage validation with Great Expectations, Lake Formation, and Azure Purview, supporting 12+ successful audit reviews with zero discrepancies.
• Streamlined CI/CD releases using Azure DevOps with containerized Spark runtimes, reducing deployment cycles from 5 hours to under 1 hour across Dev, UAT, and Prod.
• Developed AI agents using Snowflake Cortex to automate fraud-risk scoring, validate KPIs, and summarize anomalies, reducing analyst review time by 35 percent.
• Automated multi-cloud infrastructure provisioning via Terraform, eliminating environment drift across AWS and Azure resources. Florida Data Science for Social Good, Jacksonville, FL June 2024 – August 2024 Data Science Intern
• Integrated 67 county-level census datasets into a Databricks Lakehouse, using Unity Catalog for governance and collaborating with civic partners on standardized data access.
• Standardized and deduplicated 5M+ population records with PySpark, reducing data preparation workload by 120 staff hours per refresh cycle.
• Supported predictive model development using Python, Pandas, and scikit-learn to identify vulnerable communities and guide public resource allocation.
• Built Tableau dashboards and clustering models to surface 150+ undercount-prone neighborhoods, informing targeted outreach strategies for nonprofit agencies.
University of South Dakota, Vermillion, SD August 2023 – May 2024 Research Assistant
• Engineered machine learning models for breast cancer and pneumonia detection using TensorFlow, PyTorch, and scikit-learn, improving accuracy on 12k+ medical images.
• Processed and transformed large-scale imaging datasets in Python, applying feature extraction and augmentation techniques, enabling faster model training cycles by 72 hours.
• Integrated Hugging Face transformers and LLM methods to combine imaging and text data, improving multi-modal analysis across 5 clinical studies.
• Built statistical analysis and visualization pipelines with Seaborn and Matplotlib, producing 15 research reports and mentoring students on reproducible ML workflows.
• Implemented a medical RAG workflow that retrieved clinical notes and literature to contextualize model predictions, improving interpretability and supporting research presentations. Purecode Software Inc, India November 2022 – July 2023 Software Engineer / Data Engineer
• Streamlined ingestion workflows with Azure Data Factory and Databricks, collaborating with ML engineers to prepare 5TB+ AI training datasets, eliminating 200+ hours of manual intervention per quarter.
• Processed interaction data in ADLS Gen2 using PySpark joins, filters, and time-window aggregations, partnering with product managers to enable personalization features for 2M+ active users.
• Orchestrated data and AI pipelines in Apache Airflow with dependencies, retries, SLAs, and alerts to maintain reliable production runs.
• Built a centralized Snowflake warehouse with SQL models, enabling 30+ features and reducing reporting time by 3 hours.
• Designed AI workflows by integrating model inference, vector embeddings, and semantic search into production pipelines to support recommendation and retrieval-based use cases.
• Deployed containerized pipelines on AKS with Azure DevOps CI/CD, coordinating with DevOps teams to streamline promotions across 4 environments, reducing deployment time from 6 hours to under 1 hour. iMerit Technology Services Pvt. Ltd., India November 2020 – November 2022 ETL Developer / ITES
• Built ADF orchestration flows linking S3 and ADLS Gen2 to load 10TB+ into Synapse Analytics, reducing prep time by 15 hours per cycle.
• Developed PySpark pipelines scheduled through ADF, handling 500K+ NLP and image-classification records per day across S3 and ADLS.
• Containerized ETL pipelines with Docker and deployed to AKS via Azure DevOps, reducing rollback issues by 12 per quarter.
• Built Power BI dashboards visualizing 20+ SLA and KPIs, helping project managers accelerate issue detection by 2 days per release.
• Standardized metadata, schema evolution, versioning, and ETL best practices to support audit-ready workflows across 8 reviews. Dell Technologies, India August 2019 – August 2020 Data Analyst Intern
• Built ADF pipelines to extract and load 100K+ records from SQL Server and Oracle, improving data availability for reporting.
• Developed SQL and Python scripts for data cleaning and validation, increasing accuracy of recurring operational reports.
• Created Power BI dashboards with 10+ KPIs, helping sales and product teams track performance trends. SKILLS
• Programming & Scripting: Python, SQL, R, JavaScript, Bash, Shell Scripting
• Data Engineering & Big Data: Apache Spark (PySpark), Airflow, Kafka, dbt, Delta Lake, ETL/ELT, Schema Evolution, Slowly Changing Dimensions (SCD), Partitioning, Performance Tuning, Data Lineage, Great Expectations
• Cloud & Data Platforms: Azure (Data Factory, Databricks, Synapse, ADLS, Azure SQL, Functions, Event Hub, AKS, Purview, Key Vault, DevOps), AWS (S3, EMR, Glue Catalog, MSK, Lambda, Redshift), Snowflake, GCP
• AI Tools & Frameworks: Hugging Face Transformers, TensorFlow, PyTorch, Scikit-learn, LangChain, Vector Databases (FAISS, Chroma), OpenAI API, RAG Pipelines, Prompt Engineering, Model Serving on AKS/SageMaker
• Databases & Storage: SQL Server, T-SQL, Oracle, PostgreSQL, MySQL, MongoDB, Parquet, ORC, Avro, JSON, CSV
• Visualization & Analysis: Power BI, Tableau, QuickSight, Pandas, NumPy
• DevOps & Tools: GitHub, Docker, Jenkins, Linux CLI, Agile (Scrum, Kanban), Jira, Infrastructure as Code (Terraform) EDUCATION
University of South Dakota, Vermillion, SD August 2023 - December 2024 Masters in Computer Science
Jawaharlal Nehru Technological University, India June 2016 - December 2020 Bachelor of Technology in Computer Science and Engineering PROJECTS
Data Pipeline Implementation for Retail Analytics
• Built an end-to-end data pipeline using ADF, ADLS, and Synapse to process sample retail sales data and generate Power BI dashboards. Text Mining on Electronic Health Records for Cancer Care
• Developed Python-based NLP scripts to clean and analyze open-source EHR text data, applying Random Forest and SVM models to explore cancer-related term patterns.
Crypto Tracker Link
• Created a React web app using Material UI and the CoinGecko API to display real-time cryptocurrency prices and trends. CERTIFICATIONS
• SnowPro Associate: Platform Certification (Verify)
• HackerRank SQL Advanced Certificate (Verify)