Post Job Free
Sign in

Data Engineer Machine Learning

Location:
Overland Park, KS, 66213
Salary:
120000
Posted:
October 17, 2025

Contact this candidate

Resume:

Venkata Vaibhav

Data Engineer

Overland Park, KS *****************@*****.*** 309-***-**** LinkedIn SUMMARY

Data Engineer with 5+ years of experience designing and delivering scalable, high-performance data pipelines and platforms across cloud

(Azure, AWS) and on-premise big data environments. Proven track record in building real-time streaming solutions, optimizing ETL/ELT workflows, and deploying machine learning pipelines using tools such as Databricks, Apache Spark, Kafka, Airflow, and Snowflake. Skilled in SQL, Python, PySpark, and modern data stack technologies including dbt, Delta Lake, and CI/CD pipelines. Industry expertise spans technology, financial services, and healthcare, with a strong focus on data reliability, governance, and compliance (HIPAA, GDPR). EXPERIENCE

Microsoft CA Data Engineer September 2024 – Present

• Designed and deployed end-to-end ETL pipelines using Azure Data Factory, Azure Synapse Analytics, and Databricks, implementing Delta Lake medallion architecture, reducing ETL costs by 45% and improving data processing latency from 1 hour to 10 minutes.

• Developed scalable machine learning pipelines in Python with Azure Machine Learning and MLflow, cutting time-to-production for new models by 50% and improving forecast accuracy by 18%.

• Built real-time data streaming pipelines utilizing Apache Kafka, Azure Stream Analytics, and Spark Structured Streaming to ingest and process 5TB+ daily telemetry data, enabling sub-second latency dashboards and reducing incident response time by 40%.

• Implemented dbt Core on Microsoft Fabric for modular SQL transformations, standardizing 20+ KPIs across finance and sales functions; reduced SQL code duplication by 60% through Jinja templating and CI/CD integration (GitHub Actions).

• Led mentorship and onboarding for 3 junior data engineers and cross-functional analysts on modern data stack tools and data engineering best practices, accelerating project delivery by 25% and improving data quality metrics by 30%. JPMorgan Chase India Data Engineer April 2021 – July 2023

• Developed robust data processing frameworks using PySpark, SQL, and Databricks on AWS EMR, implementing Delta Lake for ACID transactions, schema enforcement, and real-time streaming using Apache Kafka, boosting data reliability by 25%.

• Designed and deployed 15+ scalable data pipelines across AWS (S3, Glue, Lambda) and on-premise Hadoop environments (HDFS, Hive, Spark), improving ETL performance, reducing data latency by 35%, and ensuring compliance with data governance standards.

• Optimized Snowflake performance by partition pruning, clustering keys, and query profiling, improving dashboard load times by 60%.

• Engineered CI/CD pipelines with Jenkins and GitLab CI, automating infrastructure provisioning using Terraform and Python scripts, reducing deployment time for data applications by 20%.

• Created automated ETL workflows using Apache NiFi and Informatica, integrating structured and semi-structured data sources including RDBMS (DB2, SQL Server), REST APIs, and market data feeds with embedded data quality checks and audit logging. Abbott India Data Engineer June 2019 – April 2021

• Engineered real-time clinical data pipelines using Apache Kafka, Spark Streaming, Azure Event Hubs, and Databricks, reducing latency by 40% for time-critical health data.

• Developed scalable ETL frameworks with Python, PySpark, and Apache Airflow, processing 3+ TB of healthcare data daily and improving overall data reliability by 25%.

• Modernized legacy ETL workflows by integrating Hadoop, Hive, and Apache Spark, reducing pipeline latency by 35% while ensuring compliance with HIPAA and GDPR regulations for sensitive healthcare data.

• Designed and optimized complex SQL queries, stored procedures, and dimensional data models in Snowflake and SQL Server, accelerating report generation by 60% and enabling real-time analytics to support clinical decision-making. TECHNICAL SKILLS

Programming & Scripting: Python, Java, Scala, SQL, PL/SQL, T-SQL, Bash, Shell Scripting Big Data & Streaming Technologies: Apache Spark, Hadoop, Hive, Pig, HBase, Flink, Kafka, MapReduce, Spark Streaming ETL & Data Orchestration Tools: Apache Airflow, dbt Core, Informatica PowerCenter, Talend, Microsoft SSIS, Prefect, Apache NiFi Cloud Platforms & Services: AWS (EC2, S3, Redshift, Glue, EMR, Athena, RDS), Azure (Data Factory, Databricks, Synapse Analytics, Data Lake, Azure Event Hubs, Azure OpenAI)

Data Warehousing & Databases: Snowflake, Redshift, Synapse, Oracle, SQL Server, PostgreSQL, MySQL, MongoDB, DynamoDB Machine Learning & MLOps: Scikit-learn, TensorFlow, PyTorch, NumPy, Pandas, SciPy, MLflow, Feature Engineering, Model Training, Tuning LLMs & Generative AI: Retrieval-Augmented Generation (RAG), LangChain, Prompt Engineering, Natural Language Processing (NLP) DevOps & Infrastructure as Code: Git, Jenkins, GitLab CI/CD, Azure DevOps, ArgoCD, Docker, Kubernetes, Terraform, CI/CD Pipelines Data Visualization & Methodologies: Power BI, Tableau, Microsoft Excel, Agile, Scrum, SDLC EDUCATION

Master’s in Computers & Information Sciences Florida Atlantic University, Boca Raton, Florida, USA. Bachelor’s in Computer Science Koneru Lakshmaiah University, Guntur, AP, India.



Contact this candidate