Data Engineer Machine Learning

Location:

Rajkot, Gujarat, India

Posted:

October 15, 2025

Contact this candidate

Resume:

Masum Baba Mohmmad

+1-682-***-**** *******************@*****.*** Denton, TX LinkedIn

PROFESSIONAL SUMMARY

Data Engineer with 4+ years of experience in designing, developing, and optimizing end-to-end data pipelines, cloud-native data architectures, and streaming solutions across Azure, AWS, and GCP platforms. Proficient in ETL/ELT development, big data processing, workflow automation, and data quality frameworks to support analytics and reporting. Accomplished in PySpark, Delta Lake, Hadoop, Apache Kafka, Databricks, Redshift, BigQuery, and Power BI/Looker Studio, with a strong focus on performance optimization, cost- efficient cloud solutions, and maintaining high data reliability. Demonstrated expertise at building scalable data lakes, implementing real-time ingestion pipelines, orchestrating workflows, and delivering actionable insights through automated dashboards and curated data marts.

TECHNICAL SKILLS

Programming & Data Engineering: Python (PySpark), SQL, Hadoop MapReduce, Spark Streaming, ETL/ELT Development Cloud Platforms & Services: Azure (Data Factory, Databricks, Data Lake Storage Gen2, Synapse Analytics, Event Hubs, ML Services), AWS (Glue, S3, Redshift, Lambda, Step Functions, DataBrew, SageMaker), GCP (BigQuery, Cloud Storage, Pub/Sub, Dataproc, BigQuery ML)

Data Storage & Processing: Delta Lake, Data Lakehouse Architecture, HDFS, Partitioning, Schema Evolution, OLAP/OLTP Data Marts Streaming & Messaging: Apache Kafka, Spark Streaming, Azure Event Hubs, GCP Pub/Sub Data Quality, Governance & Analytics: Apache Atlas, Hive Metastore, Data Validation, Data Lineage, Power BI, Amazon QuickSight, Looker Studio

Machine Learning / AI Support: ML-ready feature engineering, ML pipeline preparation, Databricks MLflow, Azure Machine Learning Performance Optimization & Orchestration: Partition Pruning, Broadcast Joins, Caching Strategies, YARN/Dataproc Autoscaling, Workflow Orchestration

WORK EXPERIENCE

Data Engineer Piper Sandler USA Mar 2025 – Present

Engineered scalable ETL and ELT pipelines using Azure Data Factory, Azure Databricks (PySpark), and Delta Lake to ingest and transform 15M+ daily trade and client portfolio records from diverse on-prem and cloud sources.

Designed and implemented a cloud-native data lakehouse architecture on Azure Data Lake Storage Gen2 with Delta format, partitioned storage, and schema evolution to ensure efficient storage management and ACID compliance.

Integrated real-time streaming frameworks with Azure Event Hubs, Kafka, and Synapse Analytics to support low-latency market data ingestion and near real-time analytics for trade risk monitoring and price anomaly detection.

Developed automated data quality and validation pipelines using PySpark, Great Expectations, and Azure Data Factory to enforce data governance, improving accuracy of downstream AI models by 25%.

Built ML-support pipelines for feature preparation, model scoring integration, and automated retraining orchestration using Azure Machine Learning and Databricks MLflow, enabling data engineers and data scientists to deploy predictive analytics efficiently.

Optimized Spark workloads and query performance through partition pruning, broadcast joins, and caching strategies, reducing ETL execution time by 40% and improving resource utilization in Azure Databricks clusters.

Built curated data marts and analytical models in Azure Synapse and Power BI to provide cross-asset risk exposure insights to investment teams, enabling data-driven decisions and reducing manual reporting by 50%. Data Engineer Code Giants India Jan 2020 – Jul 2023

Developed and optimized batch ETL pipelines using AWS Glue, PySpark, and Hadoop MapReduce, processing 10M+ records daily and reducing data processing errors by 30%.

Built cloud-based data lakes on AWS S3, Redshift, GCP BigQuery, and Cloud Storage, improving data accessibility and query performance for analytics teams by 35%.

Implemented streaming ingestion pipelines with Apache Kafka, Spark Streaming, and GCP Pub/Sub, enabling near real-time operational analytics and faster reporting for business teams.

Automated ETL workflow orchestration using AWS Step Functions, Lambda, and GCP Dataproc, cutting pipeline runtime by 40% and minimizing manual monitoring effort.

Established robust data quality and governance frameworks with AWS Glue DataBrew, Apache Atlas, and Hive Metastore, improving dataset accuracy and consistency by 25%.

Optimized big data workloads via partitioning, broadcast joins, and cluster tuning (YARN, Dataproc autoscaling), reducing processing costs and job completion times by 25–35%.

Designed interactive dashboards and automated reporting pipelines using Amazon QuickSight, Looker Studio, and BigQuery BI Engine, delivering actionable insights and reducing manual reporting by 50%.

Monitored and maintained end-to-end data pipelines, implementing logging, error handling, and alerts, ensuring 99% uptime and reliability for operational reporting.

EDUCATION

Master of Science in Information systems and Technology University of North Texas Denton, Texas, USA May 2025 B. Tech in Electronics and Communication Engineering VNR Vignana Jyothi Institute of Engineering and Technology Hyderabad, India Mar 2021 CERTIFICATION

Machine Learning 401: Zero to Mastery Machine Learning (Udemy)

Python – Introduction to Data Science and Machine Learning A-Z(Udemy)

Hands on Tableau for Data Science (Udemy)

Contact this candidate