Data Engineer Machine Learning

Location:

Charlotte, NC

Posted:

October 15, 2025

Contact this candidate

Resume:

DILEEP MUKKU

USA +1-980-***-**** ***********@*****.*** LinkedIn

Summary

Data Engineer with 3+ years of experience architecting and optimizing large-scale, cloud-native data solutions on AWS and Azure. Proven expertise in designing scalable data pipelines using PySpark, Databricks, and EMR, building real-time streaming applications with Kafka and Kinesis, and managing data warehouses with RedShift and Synapse. Demonstrated success in reducing operational costs by over 30%, improving system performance by 5x, and delivering compliant, production-ready data platforms for healthcare and fintech domains. Skills

• Data Engineering & ETL: PySpark, Spark SQL, Delta Lake, Databricks, AWS Glue, EMR, Azure Data Factory, Change Data Capture (CDC), DBT, ETL Frameworks, Data Partitioning, Data Modeling

• Cloud Platforms & Services: AWS (S3, RedShift, Kinesis, MSK, Lambda, IAM, Lake Formation), Azure (Synapse Analytics, Data Lake Storage (ADLS) Gen2, Databricks, Cosmos DB, Event Hubs, Functions)

• Big Data & Streaming: Apache Kafka, Spark Structured Streaming, AWS Kinesis, Stream Processing, Data Lakehouse Architecture, Optimized Distributed Joins

• Databases & Data Warehousing: Amazon RedShift, Azure Synapse, PostgreSQL, DynamoDB, Cosmos DB, SQL Query Optimization, Schema Design

• DataOps & DevOps: CI/CD (Azure DevOps), Airflow, Terraform, ARM Templates, Data Lineage (OpenLineage), Great Expectations, Pipeline Monitoring & Alerting

• Programming & Scripting: Python (Pandas, PySpark UDFs), SQL

(Advanced Querying, CTEs, Window Functions), Scala, Bicep

• Data Governance & Security: HIPAA, GDPR, CCPA, PII Tagging, Column-Level Encryption, IAM Policies, Data Cataloging

• Machine Learning Engineering: MLflow, Azure Machine Learning, SageMaker, Feature Store Design, Model Serving

Experience

Data Engineer MetLife (Remote, USA) 06/2024 to Current

• Architected a HIPAA-compliant medallion lakehouse on Azure Databricks and Delta Lake, centralizing 50M+ electronic health records to eliminate data silos and establish a single source of truth for clinical analytics.

• Engineered a serverless real-time streaming pipeline with Azure Event Hubs and Functions, processing 500,000+ daily claim events to enable sub-minute adjudication and enhance customer satisfaction.

• Automated Change Data Capture (CDC) workflows using Azure Data Factory and Synapse Pipelines, orchestrating the hourly replication of 15TB of provider data to support accurate network management and reporting.

• Operationalized a PySpark machine learning model for predicting high-risk patient cohorts, which improved the precision of proactive care interventions by 28% and optimized resource allocation.

• Spearheaded the Infrastructure-as-Code (IaC) initiative using Bicep, automating the provisioning of 20+ data platform resources and reducing environment setup time from days to under 15 minutes.

• Implemented a comprehensive data governance framework with column-level encryption and Azure Purview, achieving HIPAA compliance for 12 critical datasets containing protected health information (PHI).

Data Engineer Paytm (Remote, India) 06/2021 to 12/2022

• Designed and deployed a real-time fraud detection system on AWS EMR using PySpark MLlib, analyzing 100M+ transactions to reduce false positives by 22% and significantly lower operational costs.

• Built a scalable data ingestion framework to migrate 10TB of merchant data from Hadoop to Amazon RedShift, leveraging incremental loading strategies to cut ETL costs by 30%.

• Established a streaming data platform with Amazon MSK (Kafka) and Kinesis, reducing alert latency for anti-money laundering (AML) monitoring from 10 minutes to 5 seconds for the compliance team.

• Optimized the performance and cost of the data warehouse by refining RedShift distribution keys and sort keys, slashing the runtime of critical financial reports for RBI audits by 50%.

• Championed the adoption of Apache Iceberg tables via AWS Glue to enforce ACID compliance on the data lake, reducing data corruption issues in merchant settlement processes by 90%.

• Architected a tiered storage solution with S3 Lifecycle Policies and Glacier, cutting storage costs by 45% for historical transaction data while maintaining compliance and access policies.

Data Engineer Intern Paytm (Remote, India) 01/2021 to 06/2021

• Developed and scheduled PySpark data processing jobs on AWS EMR to cleanse and standardize 5M+ daily transaction records, improving data quality for downstream fraud analytics.

• Automated manual data validation processes by building Python scripts deployed on AWS Lambda, reducing daily verification efforts from 8 hours to 45 minutes.

• Contributed to the migration of legacy Hadoop workflows to cloud-native AWS EMR, resulting in a 25% decrease in processing time for merchant settlement reports.

• Created interactive operational dashboards in Amazon QuickSight to monitor ETL pipeline health metrics, which led to a 40% reduction in incident response and resolution time.

Education

Master of Science: Computer Science

Kennesaw State University

05/2024

Bachelor of Technology: Computer Science and Engineering K L University

05/2022

Education

Healthcare NLP Data Pipeline

• Designed an ETL pipeline to ingest and standardize 10M+ unstructured patient addresses using NLP (spaCy) and regex, improving data consistency by 40%.

• Built a scalable cleansing framework with Python (NLTK, Pandas) to handle missing/duplicate records, reducing manual cleanup by 30%

• Optimized MySQL schema and indexing for address data retrieval, cutting query times by 25% for reporting

• Automated data validation rules to flag inconsistencies in real-time, reducing EHR integration errors Real-Time Fraud Detection Pipeline

• Developed a streaming pipeline (Kafka + Spark Structured Streaming) to process 5M+ daily transactions with <1s latency

• Engineered feature store in Delta Lake to serve real-time inputs for ML models (Isolation Forest, SVM), achieving 90% fraud recall

• Implemented anomaly detection at scale using PySpark UDFs, reducing false positives by 20% vs. legacy systems

• Created Tableau dashboards with pipeline health metrics (throughput, drift) for operational monitoring

Contact this candidate