DILEEP MUKKU
USA +1-980-***-**** ***********@*****.*** LinkedIn
Summary
Data Engineer with 3+ years of experience architecting and optimizing large-scale, cloud-native data solutions on AWS and Azure. Proven expertise in designing scalable data pipelines using PySpark, Databricks, and EMR, building real-time streaming applications with Kafka and Kinesis, and managing data warehouses with RedShift and Synapse. Demonstrated success in reducing operational costs by over 30%, improving system performance by 5x, and delivering compliant, production-ready data platforms for healthcare and fintech domains. Skills
• Data Engineering & ETL: PySpark, Spark SQL, Delta Lake, Databricks, AWS Glue, EMR, Azure Data Factory, Change Data Capture (CDC), DBT, ETL Frameworks, Data Partitioning, Data Modeling
• Cloud Platforms & Services: AWS (S3, RedShift, Kinesis, MSK, Lambda, IAM, Lake Formation), Azure (Synapse Analytics, Data Lake Storage (ADLS) Gen2, Databricks, Cosmos DB, Event Hubs, Functions)
• Big Data & Streaming: Apache Kafka, Spark Structured Streaming, AWS Kinesis, Stream Processing, Data Lakehouse Architecture, Optimized Distributed Joins
• Databases & Data Warehousing: Amazon RedShift, Azure Synapse, PostgreSQL, DynamoDB, Cosmos DB, SQL Query Optimization, Schema Design
• DataOps & DevOps: CI/CD (Azure DevOps), Airflow, Terraform, ARM Templates, Data Lineage (OpenLineage), Great Expectations, Pipeline Monitoring & Alerting
• Programming & Scripting: Python (Pandas, PySpark UDFs), SQL
(Advanced Querying, CTEs, Window Functions), Scala, Bicep
• Data Governance & Security: HIPAA, GDPR, CCPA, PII Tagging, Column-Level Encryption, IAM Policies, Data Cataloging
• Machine Learning Engineering: MLflow, Azure Machine Learning, SageMaker, Feature Store Design, Model Serving
Experience
Data Engineer MetLife (Remote, USA) 06/2024 to Current
• Architected a HIPAA-compliant medallion lakehouse on Azure Databricks and Delta Lake, centralizing 50M+ electronic health records to eliminate data silos and establish a single source of truth for clinical analytics.
• Engineered a serverless real-time streaming pipeline with Azure Event Hubs and Functions, processing 500,000+ daily claim events to enable sub-minute adjudication and enhance customer satisfaction.
• Automated Change Data Capture (CDC) workflows using Azure Data Factory and Synapse Pipelines, orchestrating the hourly replication of 15TB of provider data to support accurate network management and reporting.
• Operationalized a PySpark machine learning model for predicting high-risk patient cohorts, which improved the precision of proactive care interventions by 28% and optimized resource allocation.
• Spearheaded the Infrastructure-as-Code (IaC) initiative using Bicep, automating the provisioning of 20+ data platform resources and reducing environment setup time from days to under 15 minutes.
• Implemented a comprehensive data governance framework with column-level encryption and Azure Purview, achieving HIPAA compliance for 12 critical datasets containing protected health information (PHI).
Data Engineer Paytm (Remote, India) 06/2021 to 12/2022
• Designed and deployed a real-time fraud detection system on AWS EMR using PySpark MLlib, analyzing 100M+ transactions to reduce false positives by 22% and significantly lower operational costs.
• Built a scalable data ingestion framework to migrate 10TB of merchant data from Hadoop to Amazon RedShift, leveraging incremental loading strategies to cut ETL costs by 30%.
• Established a streaming data platform with Amazon MSK (Kafka) and Kinesis, reducing alert latency for anti-money laundering (AML) monitoring from 10 minutes to 5 seconds for the compliance team.
• Optimized the performance and cost of the data warehouse by refining RedShift distribution keys and sort keys, slashing the runtime of critical financial reports for RBI audits by 50%.
• Championed the adoption of Apache Iceberg tables via AWS Glue to enforce ACID compliance on the data lake, reducing data corruption issues in merchant settlement processes by 90%.
• Architected a tiered storage solution with S3 Lifecycle Policies and Glacier, cutting storage costs by 45% for historical transaction data while maintaining compliance and access policies.
Data Engineer Intern Paytm (Remote, India) 01/2021 to 06/2021
• Developed and scheduled PySpark data processing jobs on AWS EMR to cleanse and standardize 5M+ daily transaction records, improving data quality for downstream fraud analytics.
• Automated manual data validation processes by building Python scripts deployed on AWS Lambda, reducing daily verification efforts from 8 hours to 45 minutes.
• Contributed to the migration of legacy Hadoop workflows to cloud-native AWS EMR, resulting in a 25% decrease in processing time for merchant settlement reports.
• Created interactive operational dashboards in Amazon QuickSight to monitor ETL pipeline health metrics, which led to a 40% reduction in incident response and resolution time.
Education
Master of Science: Computer Science
Kennesaw State University
05/2024
Bachelor of Technology: Computer Science and Engineering K L University
05/2022
Education
Healthcare NLP Data Pipeline
• Designed an ETL pipeline to ingest and standardize 10M+ unstructured patient addresses using NLP (spaCy) and regex, improving data consistency by 40%.
• Built a scalable cleansing framework with Python (NLTK, Pandas) to handle missing/duplicate records, reducing manual cleanup by 30%
• Optimized MySQL schema and indexing for address data retrieval, cutting query times by 25% for reporting
• Automated data validation rules to flag inconsistencies in real-time, reducing EHR integration errors Real-Time Fraud Detection Pipeline
• Developed a streaming pipeline (Kafka + Spark Structured Streaming) to process 5M+ daily transactions with <1s latency
• Engineered feature store in Delta Lake to serve real-time inputs for ML models (Isolation Forest, SVM), achieving 90% fraud recall
• Implemented anomaly detection at scale using PySpark UDFs, reducing false positives by 20% vs. legacy systems
• Created Tableau dashboards with pipeline health metrics (throughput, drift) for operational monitoring