Data Engineer Engineering

Location:

Phoenix, AZ

Posted:

July 03, 2025

Contact this candidate

Resume:

VENKATESWARA R KAIPU

Sr. Data Engineer

EMAIL : ******************@*****.*** Phone : +1 (480) -999-5775

PROFESSIONAL SUMMARY:

Over 8+ years of experience designing and delivering end-to-end data engineering solutions across cloud, on-premises, and hybrid environments for enterprise-scale analytics.

Skilled at designing scalable ETL and ELT frameworks using Apache Beam, Talend, DBT, and Informatica to enable high-throughput data ingestion and transformation.

Experienced in building distributed data processing systems with Apache spark, Hadoop, HiveQL, and Presto, optimizing performance with Spark-SQL for large-scale analytics workloads.

Well-versed in data modeling using Kimball methodologies, including star and snowflake schemas, with a strong focus on dimensional modeling and Lakehouse architecture.

Proficient in writing optimized SQL across dialects such as T-SQL and PL/SQL, using advanced techniques including window functions, CTEs, and partitioning strategies.

Hands-on with integrating diverse datasets using Matillion and Informatica, enabling unified data access across enterprise systems and cloud-native storage solutions.

Experienced in architecting and deploying data pipelines in cloud environments using AWS (S3, Glue, Redshift, Lambda, Kinesis, EMR), Azure (Data Factory, Synapse, ADLS), and GCP (Big Query, Dataflow, Data proc).

Skilled in orchestrating workflows using Apache Airflow and AWS Step Functions, with Infrastructure-as-Code deployments using Terraform and CloudFormation.

Adept at implementing CI/CD pipelines using Jenkins, GitHub Actions, and Azure DevOps to support reliable and automated data solution delivery.

Proficient in Python for data wrangling, analytics engineering, and feature engineering using pandas, NumPy, PySpark, and Scikit-learn.

Capable of managing and querying relational and NoSQL databases such as PostgreSQL, Oracle, SQL Server, Snowflake, and MongoDB to support varied analytical workloads.

Skilled in crafting data visualizations and dashboards using Power BI, Tableau, and D3.js to support data-driven decisions through clear business insights.

Experienced in containerizing data applications with Docker and orchestrating deployment using Kubernetes for scalable production environments.

Proven team player skilled at collaborating across data science, DevOps, product, and business teams to drive aligned, high-impact outcomes.

Adaptable and solutions-focused, known for thriving in fast-changing environments by quickly grasping new tools and frameworks to solve real-world data challenges.

TECHNICAL SKILLS:

Programming & Scripting:

Python (pandas, NumPy, matplotlib, Scikit-learn, Py Spark), SQL (T-SQL, PL/SQL)

Big Data & Distributed Systems:

Apache spark, HiveQL, Presto, Spark-SQL, Apache Hadoop, MapReduce, Apache Beam, Apache Kafka, HBase

Cloud Platforms:

AWS: S3, RDS, Redshift, Glue, Athena, Lambda, Kinesis, CloudFormation, Code Pipeline, Lake Formation, CloudWatch, EMR; Azure: Data Factory, Synapse, ADLS, SQL DB, Functions, Monitor, RBAC; GCP: Cloud Storage, Big Query, Dataflow, Data proc, Functions, Stack driver

ETL & Data Integration:

Informatica, Talend, Matillion, DBT (Data Build Tool)

CI/CD & Workflow Orchestration:

Apache Airflow, AWS Step Functions, Jenkins, GitHub Actions, Azure DevOps, Bitbucket,Terraform, Git

Databases & Query Engines

Oracle, PostgreSQL, MySQL, SQL Server, Snowflake, MongoDB

Data Modelling & Architecture:

Kimball Modeling (Star & Snowflake schemas), Dimensional Modeling, Lakehouse Architecture, Data Marts, Data Warehouses

Visualization & Reporting:

Power BI, Tableau, D3.js, Excel (Pivot Tables, Conditional Formatting)

DevOps & Containers:

Docker, Kubernetes

Other Tools & Platforms:

Spark MLlib, Bugzilla, JIRA, Confluence

WORK EXPERIENCE:

GE HealthCare January 2024 – Present

Senior Data Engineer

Responsibilities:

Product: Edison Platform

Designed and maintained scalable ETL pipelines on AWS using AWS Glue, Lambda, and S3 to ingest and process multi-modal healthcare data (DICOM images, EHR, vitals) into the Edison Platform for AI-driven diagnostics.

Collaborated with data scientists and product teams to deploy and monitor ML models for clinical applications (e.g., pneumothorax detection, ICU alerting) using Amazon SageMaker and integrated model outputs into Edison dashboards.

Implemented robust data governance and compliance workflows using AWS Lake Formation, KMS encryption, and IAM roles to ensure HIPAA-compliant data access and auditability across distributed Edison services.

Partnered with data scientists to build feature pipelines using pandas and NumPy, enabling faster ML model development across 5+ predictive use cases.

Implemented SQL solutions with window functions and CTEs to support trend analyses and cohort reporting on datasets exceeding 500M records, reducing query time by ~40%.

Developed optimized Spark-SQL transformations in Databricks, decreasing end-to-end ETL job durations by 30–50% across multi-source data inputs.

Maintained governed datasets using AWS Lake Formation, applying row-level security and data access controls for enterprise-wide compliance.

Managed AWS RDS environments (MySQL), tuning queries, enforcing backup policies, and evolving schemas with zero downtime.

Automated infrastructure provisioning using AWS CloudFormation, deploying reusable templates to spin up Redshift clusters, Glue jobs, and IAM roles in <10 minutes.

Built scalable data pipelines using Apache Beam and Spark, processing real-time and batch streams to support analytics and machine learning workflows.

Designed real-time ingestion systems using AWS Kinesis, achieving sub-second latency for product telemetry and operational alerting pipelines.

Applied Kimball dimensional modeling to design fact and dimension tables, enabling self-service BI and improving query performance for 20+ business users.

Maintained modular and testable transformation logic with DBT, reducing pipeline debugging time by ~35% and increasing development velocity.

Scheduled and optimized batch jobs in, implementing partitioning and bucketing strategies for querying 20+ TB of S3-based data.

Managed schema discovery and job orchestration using AWS Glue, leveraging crawlers and triggers to automate pipeline execution across systems.

Led CI/CD implementation for data workflows using AWS Code Pipeline, automating build/test/deploy stages and reducing production rollouts from hours to minutes.

Deployed event-driven pipelines using AWS Lambda, triggering file-based processing and reducing lag in ingestion cycles by >60%.

Built real-time dashboards with D3.js and JavaScript, visualizing data quality KPIs and pipeline SLAs for engineering and DevOps stakeholders.

Developed REST-based ingestion APIs and integrated with external systems to enable bi-directional data flows supporting microservices and third-party apps.

Transformed semi-structured data at scale using MapReduce in legacy Hadoop clusters, maintaining backward compatibility for historical processing jobs.

Modeled analytics-ready data in AWS Redshift, while optimizing performance with Presto queries for large-scale reporting and dashboarding.

Containerized data apps with Docker and orchestrated workloads on Kubernetes, standardizing deployments across dev, test, and prod environments.

Enabled observability using AWS CloudWatch, setting up custom alerts and metrics to proactively monitor ETL performance and pipeline failures.

Environment: pandas, NumPy, SQL, Spark-SQL, Databricks, AWS Lake Formation, AWS RDS, AWS Redshift, AWS Glue, Apache Beam, Apache spark, AWS Kinesis, Kimball Modeling, DBT, Apache Hive, S3, D3.js, JavaScript, Hadoop, Presto, Docker, Kubernetes, AWS CloudWatch.

Wells Fargo June2021 – December 2023

Senior Data Engineer

Responsibilities:

Product: Customer 360 platform

Built and maintained data ingestion pipelines using Amazon S3, AWS Lambda, and Amazon EMR to process and integrate diverse customer datasets into the Customer 360 data lake for enhanced analytics capabilities.

Developed streaming data workflows with Amazon Kinesis Data Streams and AWS Lambda to capture and transform real-time customer interactions, enabling actionable insights and timely marketing responses.

Implemented data cataloging and governance solutions leveraging AWS Lake Formation and AWS Athena to enforce schema management, data access controls, and ensure compliance across the Customer 360 platform

Designed and managed 40+ scalable ETL pipelines in Informatica, integrating data from legacy systems into AWS-based data warehouses, achieving >99% pipeline uptime.

Authored complex PL/SQL procedures to extract and cleanse 10 M+ records/month from Oracle databases, enhancing transformation speed and data reliability.

Queried large-scale semi-structured datasets using Amazon Athena, enabling ad hoc analytics directly on S3 and reducing reporting turnaround by ~50%.

Streamed real-time data from distributed sources via Apache Kafka, supporting low-latency pipelines that powered dashboards for operations and executive monitoring.

Orchestrated multi-step data workflows using AWS Step Functions, automating complex logic with conditional branching and eliminating manual triggers.

Built event-driven ingestion pipelines using AWS Lambda, reducing the delay between data generation and processing by up to 80%.

Developed batch transformation jobs in AWS Glue, enabling schema evolution and metadata cataloging across 15+ dynamic datasets.

Managed object storage lifecycles in Amazon S3, optimizing tiered storage for archiving 20+ TB of data, leading to a 25% cost reduction.

Integrated external APIs and internal microservices via RESTful interfaces, enabling seamless data exchange between third-party and in-house platforms.

Tuned and deployed Spark workloads on Amazon EC2 clusters, reducing memory bottlenecks and increasing job throughput by ~35% for large datasets.

Built scalable transformations in Databricks (Apache spark) using lazy evaluation and caching, improving job execution time by over 40%.

Implemented secure, governed lake house architectures using AWS Lake Formation, centralizing access controls for 20+ teams across the data ecosystem.

Modeled Snowflake databases using best-practice Snowflake Schema patterns, supporting complex business logic for marketing and sales analytics.

Conducted exploratory data analysis (EDA) and data profiling using Python and Matplotlib, detecting anomalies and ensuring high data fidelity before ingestion.

Wrote advanced SQL and NoSQL queries, bridging relational and document-based systems to create unified views for downstream consumption.

Engineered big data workflows on AWS EMR integrated with Hadoop, scaling up compute for data jobs spanning 5–10 TB/day using spot and on-demand nodes.

Containerized transformation code in Docker and deployed to Kubernetes clusters, standardizing execution across dev, QA, and production environments.

Implemented CI/CD pipelines using Jenkins and Terraform, automating infrastructure provisioning, version control, and deployment for data jobs.

Developed metadata-driven ETL workflows in Matillion for Snowflake, promoting reusability and reducing development time for new data pipelines by ~30%.

Environment: Informatica, PL/SQL, Oracle, Amazon Athena, Kafka, AWS Step Functions, AWS Lambda, AWS Glue, Amazon S3, RESTful APIs, Apache spark, Databricks, Amazon EC2, AWS Lake Formation, Snowflake, Python, Matplotlib, SQL, NoSQL, AWS EMR, Hadoop, Matillion.

JP Morgan Chase February 2020 – May 2021

Data Engineer

Responsibilities:

Product: Payment Processing Systems

Developed and optimized scalable ETL pipelines using Azure Data Factory and Azure Databricks to ingest, transform, and validate high-volume payment transaction data, ensuring low latency and high reliability for real-time payment processing.

Implemented real-time streaming data solutions with Azure Event Hubs and Azure Stream Analytics to monitor and process payment events, enabling rapid fraud detection and settlement reconciliation.

Collaborated with security and compliance teams to enforce data governance policies using Azure Purview and Azure Key Vault, ensuring PCI-DSS compliance and secure handling of sensitive payment information across payment platforms.

Built and optimized batch and streaming pipelines using Apache spark and PySpark, processing over 8 TB/week of structured and semi-structured data to support business-critical reporting.

Developed machine learning feature pipelines using Spark MLlib, transforming data for 5+ predictive models used in production for fraud detection and customer segmentation.

Orchestrated 30+ ETL workflows in Azure Data Factory (ADF), integrating data from APIs, flat files, and RDBMS into curated layers within Azure Synapse Analytics, improving pipeline reliability by 40%.

Designed scalable folder structures and access tier strategies in Azure Data Lake Storage (ADLS), reducing data retrieval time by ~35% for analytics teams.

Automated data ingestion, validation, and transformation using Python scripts, cutting down manual processing by 10+ hours/week and improving data quality checks.

Tuned stored procedures and indexing in Azure SQL Database, achieving up to 60% faster query performance for BI and operations dashboards.

Used Azure Functions to automate metadata-driven workflows, reducing manual monitoring and triggering errors by over 50%.

Deployed infrastructure and pipelines via PowerShell scripts, embedding Infrastructure-as-Code (IaC) into the deployment lifecycle for repeatable, auditable rollouts.

Streamed real-time telemetry and log data with Azure Stream Analytics, delivering actionable alerts within seconds of event generation for IoT monitoring use cases.

Scheduled and monitored interdependent DAGs in Apache Airflow, maintaining 99.9% uptime for data workflows across cloud services.

Modeled and maintained star-schema data marts, improving Power BI report responsiveness by ~45% and enabling self-service analytics for 50+ end users.

Integrated document-based data from MongoDB, transforming JSON documents into structured formats for analytics and reporting.

Built and delivered 10+ interactive dashboards using Power BI, enabling leadership to track KPIs such as sales velocity, churn, and usage patterns.

Monitored data quality and system performance using Azure Monitor and Application Insights, setting up real-time alerts and dashboards to reduce incident resolution time by ~30%.

Enforced RBAC across Azure services, ensuring compliance with internal access policies and audit standards for data handling.

Used Azure DevOps and Bitbucket for version control, automated builds, and CI/CD, improving deployment speed and reducing rollback risks during releases.

Environment: Apache spark, Py Spark, Spark MLlib, ADF, Azure Synapse Analytics, ADLS, Python, Azure SQL Database, Azure Functions, PowerShell, Azure Stream Analytics, Apache Airflow, MongoDB, Power BI, Azure Monitor, Azure RBAC, Azure DevOps, Bitbucket, SQL.

Performance Food group September 2018 – January2020

Data Engineer

Responsibilities:

Designed and deployed distributed data pipelines using Apache spark and Hadoop to process 10+ TB/week of structured and unstructured data, improving analytics readiness across departments.

Deployed Spark clusters on Google Cloud Dataproc, tuning executors and memory configurations to reduce job execution time by 35% and cut cloud costs by ~20%.

Wrote and optimized 100+ complex SQL queries and stored procedures in BigQuery, powering dashboards and analytics consumed by marketing and finance teams.

Built real-time streaming pipelines with Apache spark Streaming, supporting sub-second latency ingestion for business-critical applications processing 10K+ events/second.

Orchestrated end-to-end ETL workflows in Talend to ingest and normalize data from 10+ enterprise systems, including Oracle, into Google Cloud data lakes.

Developed and maintained streaming and batch data pipelines on Google Cloud Dataflow, enriching data and pushing it to BigQuery and downstream analytics systems.

Coded scalable data transformations in Scala, applying object-oriented and functional programming principles to reduce code redundancy and enhance maintainability.

Scheduled and monitored 40+ DAGs using Apache Airflow, ensuring SLAs were met and automating email alerts for failure recovery and pipeline troubleshooting.

Built and managed collaborative notebooks and pipelines in Databricks, accelerating team productivity through shared development and optimized cluster configurations.

Administered Google Cloud Storage buckets with lifecycle policies and IAM configurations, ensuring 100% compliance with data retention and access controls.

Created serverless event-driven pipelines using Google Cloud Functions to automate tasks triggered by file uploads or pub/sub events, eliminating manual intervention.

Delivered production-grade GCP pipelines integrating Data proc, Dataflow, and Big Query with logging and exception handling, achieving 99.9% uptime.

Configured pipeline observability via Stack driver, building custom dashboards and alerts that reduced debugging time by 45% during incident investigations.

Implemented CI/CD pipelines using GitHub Actions to automate deployment and validation of Spark jobs, reducing code promotion delays by ~50%.

Practiced Agile methodologies using Kanban boards in tools like Jira, tracking sprint progress, refining backlog items, and collaborating with cross-functional teams of 8+.

Contributed to a collaborative DevOps culture, applying version control, unit testing, and code reviews to maintain high release velocity with minimal regressions.

Environment: Apache spark, Hadoop, Google Cloud Data proc, Big Query, Apache spark Streaming, Talend, Oracle, Google Cloud Dataflow, Scala, Apache Airflow, Databricks, Google Cloud Storage, Google Cloud Functions, GCP, Stack driver, GitHub Actions, SQL, Kanban.

Cognizant Solutions/Mattel March 2016 - August 2017

Jr. Data Engineer

Responsibilities

Built and maintained over 25+ ETL workflows in Talend, integrating data from Oracle and flat files, leading to a 30% improvement in data pipeline reliability across business units.

Processed and transformed datasets with pandas and NumPy, cleansing over 1 M+ records monthly to ensure data consistency for reporting and analytics.

Developed PySpark batch jobs in Apache spark to handle 3–5 TB/day of incoming data, reducing ingestion time by 40% and enhancing daily processing efficiency.

Administered Hadoop HDFS clusters storing 50+ TB of data, implementing data replication policies to improve fault tolerance and uptime.

Wrote and optimized 150+ HiveQL queries for filtering, aggregation, and reporting, accelerating dashboard performance for data analysts by ~35%.

Enabled real-time lookups on fast-changing datasets using HBase, reducing data retrieval latency by 50% for time-critical services.

Created and published 10+ Tableau dashboards, tracking KPIs such as customer churn, product usage, and system latency, accessed by 50+ business users weekly.

Automated recurring Excel reports using Pivot Tables and formulas, cutting manual reporting time by 6+ hours/week for operations and sales teams.

Implemented indexing and table partitioning strategies on large datasets, improving query response times by up to 60% during peak load.

Executed complex SQL queries involving multi-table joins, subqueries, and window functions to support ad hoc analysis and decision-making for cross-functional teams.

Participated in QA efforts using Bugzilla, logging and resolving 30+ data-related bugs, improving critical data pipelines' quality and uptime.

Authored Python automation scripts for 20+ data processing tasks, improving task consistency and saving ~10 hours/month in manual effort.

Supported machine learning teams by engineering 20+ input features and building Scikit-learn pipelines for model training and evaluation on historical datasets.

Environment: Talend, Oracle, pandas, NumPy, Apache spark, PySpark, Hadoop HDFS, HiveQL, HBase, Tableau, Excel, SQL, Bugzilla, Python, Scikit-learn.

Contact this candidate