Diwakar Gupta
Data Engineer
Denton, TX ****************@*****.*** 940-***-**** LinkedIn
SUMMARY
Data Engineer with 5+ years of experience designing and deploying scalable data pipelines, real-time streaming systems, and cloud-based analytics solutions across financial services and enterprise environments. Skilled in technologies including Apache Kafka, Spark Structured Streaming, Snowflake, AWS (S3, Redshift, Glue, SageMaker), Azure Data Factory, dbt, Terraform, and CI/CD tools such as Airflow, Jenkins, and GitLab. Proven track record of improving data latency by over 90%, reducing infrastructure costs by 30%, and enhancing deployment reliability by 45%. Experienced in automating ML pipelines and partnering with Data Science, Trading, and Risk teams to deliver high-impact, production- ready solutions that drive decision-making and operational efficiency. EXPERIENCE
JPMorgan Chase NY Data Engineer September 2023 – Present
• Built real-time data streaming pipelines using Apache Kafka and Spark Structured Streaming, reducing trade settlement latency from 15 minutes to under 500ms to enhance real-time analytics and trading decisions.
• Optimized complex SQL queries and materialized views in Snowflake, improving trading analytics performance and reducing report generation time from 10 minutes to under 30 seconds for high-frequency trading teams.
• Designed and deployed scalable cloud-based data pipelines on AWS, utilizing Amazon S3, Redshift, and AWS Glue, achieving a 30% reduction in infrastructure costs and improving data accessibility across cross-functional teams.
• Developed machine learning feature pipelines integrating AWS SageMaker, Databricks, and Python (Pandas, Scikit-learn) to automate feature engineering and model training, improving risk model forecast accuracy by 20%.
• Automated release and deployment processes with CI/CD tooling (Airflow, Jenkins, GitHub Actions), and enforced consistency via Infrastructure-as-Code (IaC) using Terraform, decreasing deployment errors by 45%.
• Integrated dbt (Data Build Tool) with retrieval-augmented generation (RAG) documentation workflows to automate data lineage tracking and metadata management, increasing data discoverability by 40% and cutting compliance effort by 25%.
• Collaborated with Data Science, Trading, and Risk Management teams to deliver low-latency pipelines, boosting ML model integration and improving trading strategy performance by 18%. Cognizant India Data Engineer May 2019 – July 2022
• Engineered a scalable big data processing solution using Hadoop ecosystem tools (HDFS, Hive, MapReduce) to process and analyze over 10TB of data, reducing batch job runtimes by 40% and lowering storage costs.
• Designed and maintained 100+ scalable ETL pipelines using Python, PySpark, and Azure Data Factory; automated task orchestration and SLA monitoring with Apache Airflow, increasing pipeline reliability by 25%.
• Developed CI/CD pipelines using Jenkins and GitLab CI/CD, automating ETL workflow deployments across development, QA, and production environments, reducing release cycle time by 50%.
• Constructed end-to-end ML pipelines using PySpark MLlib and Scikit-learn, automating key stages of the model lifecycle, improving prediction accuracy by 22%, and reducing manual intervention by 40%.
• Created dynamic Power BI reporting tools with high-efficiency DAX expressions and optimized SQL queries, empowering senior leadership with real-time business intelligence and cutting support requests by 30%.
• Contributed to the delivery of 3+ enterprise data migration projects for Fortune 500 clients, collaborating with cross-functional teams of 5+ to transition legacy systems to modern cloud-based architectures, resulting in a 35% reduction in operational costs. SKILLS
Programming & Scripting: Python, Java, Scala, SQL, Shell Scripting Big Data & Streaming Technologies: Apache Spark, Apache Flink, Apache Kafka, Apache Hive, Hadoop, Apache NiFi, MapReduce ETL, Orchestration & Data Integration: PySpark, Apache Airflow, Apache Beam, dbt (Data Build Tool), SSIS, Informatica PowerCenter Cloud Platforms & Services: AWS (EC2, S3, Lambda, Glue, RDS, IAM), Azure (Data Lake Storage, Databricks, Synapse Analytics) Machine Learning & AI: SageMaker, Scikit-learn, PySpark MLlib, TensorFlow, PyTorch, Pandas, NumPy, Matplotlib, Feature Engineering, MLOps Pipelines, Predictive Analytics, Forecast Modeling, Retrieval-Augmented Generation (RAG), Automated Machine Learning (AutoML) Data Warehousing & Databases: Snowflake, Redshift, Synapse Analytics, Teradata, PostgreSQL, MySQL, SQL Server, Oracle, DynamoDB DevOps & Infrastructure as Code (IaC): Jenkins, Git, GitHub Actions, GitLab CI/CD, Docker, Kubernetes, Terraform Data Visualization & Business Intelligence: Power BI, Tableau, Microsoft Excel EDUCATION
Master’s in Computer Engineering University of North Texas, Denton, TX, USA.