Diwakar Gupta
Data Engineer
Denton, TX ************.***@*****.*** 940-***-**** LinkedIn
PROFESSIONAL SUMMARY
• Data Engineer with 5+ years of experience designing, developing, and optimizing large-scale data pipelines, distributed processing systems, and cloud-based data platforms across global enterprises including Meta, BlackRock, and Honeywell. Skilled in building real- time and batch ETL workflows leveraging Python, SQL, Apache Spark, PySpark, Kafka, Airflow, and dbt to deliver reliable, high- performance data solutions.
• Proven expertise across AWS (Redshift, Kinesis, Glue), Azure (Data Lake, Delta Lake, Synapse), and Snowflake, with strong focus on data modeling, performance optimization, and automation through CI/CD, Docker, Kubernetes, and Terraform. Experienced in feature engineering and ML data pipelines supporting production-grade machine learning systems using FBLearner Flow and PySpark MLlib. Recognized for improving data throughput, reducing latency, and enhancing analytical reliability across petabyte-scale environments. TECHNICAL SKILLS
Programming & Scripting: Python, Java, Scala, R, Shell Scripting, SQL (T-SQL, PL/SQL), NoSQL (MongoDB, DynamoDB, Cassandra) Big Data & Streaming Technologies: Spark, PySpark, Hadoop (Hive, HDFS, MapReduce), Kafka, Kinesis, Data Lake Architecture ETL & Workflow Orchestration: Apache Airflow, AWS Glue, dbt Core, SSIS, Batch Processing, Pipeline Automation, Data Integration Databases & Data Warehousing: Snowflake, Redshift, PostgreSQL, MySQL, Oracle, Data Modeling, Data Governance, Query Optimization Cloud Platforms: AWS (Redshift, S3, Glue, Lambda, EMR, Athena, EC2), Azure (Databricks, Synapse, Data Lake Storage, Azure ML) DevOps & Infrastructure as Code: Docker, Kubernetes (EKS, AKS), Jenkins, Git, GitHub Actions, Terraform, CI/CD Pipelines Machine Learning & MLOps: Scikit-learn, PyTorch, TensorFlow, Pandas, NumPy, MLflow, Model Deployment, Model Monitoring, Feature Store, Natural Language Processing (NLP), LLMs Pipelines Business Intelligence & Visualization: Power BI, Tableau, SSRS, Advanced Excel (Formulas, Pivot Tables), KPI Dashboard Development Other Core Competencies: Agile (Scrum) Methodologies, SDLC, Data Quality (Great Expectations), Performance Tuning, UDFs, Recursive CTEs EXPERIENCE
Meta – Data Engineer
Location: California, USA Duration: January 2025 – Present
• Designed, developed, and optimized petabyte-scale data pipelines using Python and PySpark on Meta’s XStream and Hive frameworks, improving data throughput by 20% for real-time machine learning signals.
• Built scalable feature engineering pipelines using PySpark and Meta’s Tectonic layer to process billions of records, increasing model training efficiency by 25%.
• Partnered with machine learning engineers to deploy real-time recommendation models using FBLearner Flow, building low-latency feature stores that improved personalization and boosted user engagement by 12%.
• Automated 15+ end-to-end ETL workflows by creating dynamic Airflow DAGs and optimizing Presto queries for distributed execution, reducing cross-regional data latency by 30%.
• Developed a modular data transformation layer using dbt and advanced SQL (CTEs, window functions) to standardize business logic and ensure data consistency across 10+ enterprise datasets. BlackRock – Data Engineer
Location: New York, USA Duration: September 2023 – December 2024
• Optimized complex SQL workloads in Snowflake, utilizing query parallelization, data partitioning, and warehouse scaling to reduce Value-at-Risk (VaR) computation time from 3 hours to 30 minutes for high-volume portfolio analytics.
• Engineered and maintained event-driven data pipelines using Apache Kafka, Apache Spark, and Apache Airflow, supporting real-time ingestion of over 8M+ financial transactions daily with low-latency SLAs.
• Designed and integrated Azure Data Lake Storage Gen2 with Delta Lake and hierarchical namespace support, enabling petabyte-scale analytics while cutting storage costs by 30% through automated lifecycle policies.
• Built containerized CI/CD pipelines using Docker, Kubernetes, and Jenkins; automated cloud infrastructure provisioning via Terraform, achieving a 99.9% deployment success rate and reducing environment setup time by 65%.
• Developed robust Python-based data validation frameworks with Pandas, PySpark, and Great Expectations, automating data quality assurance and improving dataset reliability by 25% for investment analytics pipelines. Honeywell – Data Engineer
Location: India Duration: May 2019 – July 2022
• Tuned large-scale ETL pipelines within the Hadoop ecosystem (HDFS, Hive, MapReduce, Sqoop), enhancing processing speed by 20% through advanced query optimization techniques (joins, partitioning, bucketing, ORC formats) and orchestration via Apache Oozie.
• Designed and optimized SQL-based ETL workflows on AWS Redshift and Snowflake to process IoT and building management system data, improving data transformation efficiency by 45% and reducing query costs by 25%.
• Built real-time data streaming pipelines using Python, Apache Kafka, and AWS Kinesis to ingest and process telemetry from industrial equipment, enabling live performance monitoring and reducing incident response time by 20% through predictive alerting.
• Managed multiple concurrent data engineering projects supporting industrial automation and smart building solutions, collaborating with cross-functional teams under an Agile framework to improve project delivery timelines by 15%. EDUCATION
Master of Computer Engineering, University of North Texas, Denton, TX, USA.