Venkat Reddy
******.*****@*****.*** +1-205-***-**** USA Open-to-Relocate
Summary
Data Engineer with 4+ years of experience designing and building mission-critical data pipelines on cloud-based big data platforms using Python, PySpark, Spark, Kafka, Airflow, dbt, AWS, and Databricks. Strong background in batch, streaming, and near real-time data ingestion, integration, curation, data quality, metadata, audit automation, schema design, and reusable framework development for analytics, reporting, and ML workloads. Proven ability to collaborate with business analysts, product managers, data scientists, and business stakeholders to deliver scalable self-service data solutions. Skills
Programming Language: Python, SQL, T-SQL, PySpark, Scala, Java, C#, Unix Shell Big Data & Ecosystem: Apache Spark, Spark Streaming, Kafka, Hadoop, Hive, HDFS, Flink, Batch Processing, Real- Time Processing
Cloud Technologies: AWS, Azure, GCP
Data Warehousing & Lakehouse: Snowflake, Amazon Redshift, Delta Lake, PostgreSQL ETL & Data Processing Tools: Apache Airflow, dbt, NiFi, Databricks Workflows, Matillion, AWS Glue, Workflow Automation, Data Pipeline Automation
Data Modeling & Architecture: Data Warehouse Architecture, Data Lakehouse Architecture, Database Architecture, Schema Design, Distributed Architecture, Star Schema, Snowflake Schema, SCD Type 1/2, Medallion Architecture Version Control & DevOps: Git, Jenkins, Docker, Terraform, CI/CD, CloudWatch, Grafana Data Quality & Monitoring: Great Expectations, PyTest, Data Quality Checks, Metadata Management, Metadata Publishing, Audit Validation, Pipeline Testing, Job Automation Visualization & Analytics: Tableau, Power BI, Amazon QuickSight AI / ML Data Engineering: ML Data Pipelines, ML Workload Optimization, Feature Engineering, ML Dataset Preparation, Self-Service Reporting, Data Provisioning Professional Experience
Client: Citizens Bank, Providence, RI Apr 2024 - Present Role: Data Engineer
Designed mission-critical batch, event-driven, and near real-time data pipelines using Python, PySpark, Spark SQL, and SQL to process 500GB–1TB/day of data across AWS S3, ADLS Gen2, PostgreSQL, and REST APIs, supporting 10+ downstream systems and reducing data processing delays by 25%.
Improved Databricks and Spark pipelines for reporting and analytics use cases, reducing data latency by 35–40% and improving query performance by 30% for multiple business teams.
Developed Kafka and Spark Streaming pipelines processing 40K–60K events per minute, reducing data availability lag from 15 minutes to under 2 minutes for real-time applications.
Created data models in Snowflake and Amazon Redshift using Star Schema, Snowflake Schema, and SCD Type 1/2, improving reporting efficiency by 30% and supporting 15+ analytics use cases.
Built reusable frameworks and workflows using Apache Airflow, Databricks Workflows, Azure Data Factory, and Matillion, automating 20+ pipelines and reducing manual effort by 40%.
Implemented data quality checks, audit validation, and pipeline testing using Great Expectations, PyTest, and SQL, reducing reporting defects by 30% and improving production reliability to 99.9%.
Enhanced data pipelines using Databricks SQL and Scala and integrated Apache NiFi workflows for batch and real-time ingestion, improving scalability by 35% and reducing failures by 25%.
Monitored production data pipelines using CloudWatch and Grafana and supported deployments through Git, Jenkins, Docker, and Terraform, reducing issue resolution time by 35% and improving release efficiency by 25%.
Published curated datasets and metadata for 50+ users and maintained documentation, reducing data access turnaround time by 30%.
Worked with business stakeholders to understand reporting needs and deliver scalable data solutions, improving reporting efficiency by 25% and reducing manual effort by 20%.
Built ML-ready data pipelines using PySpark and SQL to prepare feature-engineered datasets for fraud detection and risk analytics, improving model performance by 20% and reducing data preparation time by 30%.
Guided 4+ team members on data engineering best practices, improving code quality by 30% and reducing production defects by 20%.
Client: Novartis, India Apr 2021-Jul2023
Role: Data Engineer
Planned and supported highly scalable distributed data engineering pipelines on AWS EMR and Azure Databricks using Python, PySpark, SQL, and Spark SQL to process 1–2TB/day of clinical and operational data, improving pipeline throughput by 25%.
Transformed Hive and HDFS-based datasets into PySpark workflows and delivered analytics-ready outputs for 10+ Tableau and Amazon QuickSight dashboards, improving reporting performance by 30% and reducing manual data preparation by 20%.
Integrated ingestion workflows to bring data from SQL Server, MongoDB, cloud storage, and source systems into Snowflake and Databricks environments, increasing data availability by 30% for analytics teams.
Applied data warehouse and lakehouse architecture principles to build curated Bronze, Silver, and Gold layers, improving data consistency by 35% and supporting 10+ reporting and ML use cases.
Migrated 20+ legacy ETL pipelines to Snowflake and dbt, improving execution performance by 30% and reducing reporting turnaround time by 25%.
Created reusable ingestion and integration frameworks for incremental and full data loads, improving scheduling reliability and reducing pipeline failures by 20%.
Automated Kafka and Spark Streaming pipelines using AWS Lambda to support near real-time processing, reducing latency from 30–45 minutes to under 5 minutes and improving data freshness by 40%.
Ensured compliance with healthcare data standards (HL7/FHIR) and implemented PHI-safe data handling, including data masking and access controls, improving data security compliance by 30% and reducing audit issues by 25%.
Established data validation, schema checks, metadata standards, and documentation in Agile environments, improving data quality by 30% and reducing data-related issues reported by stakeholders by 25%. Education
Masters in Data Analytics-Indiana Wesleyan University, Marion, Indiana, USA. Certification
AWS certified: AWS Solutions Architect (Associate).