Data Engineer - PySpark, AWS, ETL, Big Data Expert

Location:

Denver, CO, 80204

Posted:

November 17, 2025

Contact this candidate

Resume:

Sai Vineeth Neeli

469-***-**** ***********@*****.*** Open to Relocation LinkedIn

Professional Summary

Over 7 years of experience as a Data Engineer managing large-scale ETL pipelines processing complex transactional datasets across consumer and corporate banking systems.

Leveraged PySpark and Spark SQL to implement high-performance data transformations for batch and streaming workloads, improving processing efficiency by up to 50%.

Executed large-scale data ingestion workflows using AWS Glue and Airflow, consolidating structured and semi-structured datasets into Redshift and Snowflake for analytics teams.

Optimized SQL queries and Spark jobs to reduce runtime and resource consumption on multi-terabyte datasets, accelerating reporting and analytics delivery.

Implemented automated data validation and reconciliation frameworks in Python to detect anomalies, duplicates, and missing data, improving overall data quality.

Integrated Kafka and Kinesis streaming pipelines for near real-time ingestion of credit card, loan, and payment transactions into enterprise data platforms.

Managed distributed Hadoop clusters for batch processing and Spark clusters for real-time analytics, ensuring high availability and system reliability.

Built and maintained S3-based data lakes with EMR clusters, enabling cost-effective storage and scalable processing for billions of daily transactions.

Created metadata catalogs and lineage tracking using AWS Glue Catalog and Apache Atlas to improve traceability and transparency across pipelines.

Automated workflow orchestration with Airflow and AWS Step Functions to reduce manual intervention and ensure consistent pipeline execution.

Applied partitioning, indexing, and caching techniques in Spark, Hive, and Redshift to optimize query performance on large-scale datasets.

Monitored ETL pipelines and streaming workflows using CloudWatch, EventBridge, and Airflow alerts, improving operational reliability and response times.

Embedded security best practices with IAM, encryption-at-rest/in-transit, and tokenization to protect sensitive financial and customer data.

Migrated on-premises Oracle, Teradata, and Sybase warehouses to Snowflake and Redshift, increasing scalability and query performance for analytics teams.

Collaborated with analysts and data scientists to integrate diverse financial, transactional, and behavioral datasets into centralized platforms for downstream analytics.

Implemented reusable Python and PySpark modules to standardize transformations, reducing pipeline development time across multiple projects.

Leveraged Spark MLlib and Python libraries to support preprocessing pipelines for predictive analytics without embedding compliance-heavy ML details.

Streamlined batch ETL and near real-time streaming pipelines to support credit risk, portfolio monitoring, and fraud analytics use cases efficiently.

Built dimensional data models for reporting and analytics, improving reporting performance and query efficiency.

Applied cost optimization techniques on AWS and EMR clusters, reducing infrastructure costs while maintaining high performance for large datasets.

Enabled high-throughput ingestion from multiple APIs like SWIFT and ACH, and third-party transactional feeds, integrating into enterprise data lakes.

Implemented anomaly detection and error-handling frameworks to ensure pipeline resilience, minimizing failures and downtime across production workloads.

Coordinated CI/CD workflows using Git, Jenkins, and automated deployment pipelines to manage ETL updates and versioning efficiently.

Orchestrated Spark streaming jobs for near real-time analytics of credit, payments, and transaction data, providing actionable insights for operations teams.

Maintained audit-ready data pipelines with detailed logging, lineage, and monitoring for better transparency and control purposes.

Applied advanced SQL optimization techniques on Redshift, Snowflake, and Hive to improve performance on very large queries for analytics teams.

Implemented data partitioning strategies in Spark and Hive to support scalable processing of frequent transaction data.

Conducted root cause analysis on failed jobs, correcting data issues and improving pipeline stability across batch and streaming workloads.

Built interactive dashboards in Tableau and Power BI to visualize transactional patterns, credit performance, and portfolio metrics for business users.

Recognized for delivering highly scalable, maintainable, and efficient ETL and Big Data pipelines that improved operational speed, data quality, and analytical insights.

Technical Skills

Cloud Platforms

AWS (Glue, S3, Lambda, EMR, EC2, RDS, Redshift, SNS, SQS, IAM, Kinesis, CloudFormation), GCP (BigQuery, Dataflow, Cloud Storage)

Programming Languages

Python (Pandas, NumPy, PySpark), Java, Scala, R

Big Data Technologies

Apache Spark, Snowflake, Hadoop (HDFS, MapReduce, YARN, Sqoop, Flume, Oozie, Zookeeper), Hive, Pig, Impala, Kafka

ETL/ELT Development

Informatica, Talend, AWS Glue, End-to-end ETL processes, Data Lake, Data Warehouse Development

Data Modeling

Relational and Dimensional Models (3NF, Star, Snowflake schemas)

Visualization & Reporting

Power BI, Tableau, Dynamic Dashboards, Cloud RDBMS Reporting

Data Governance & Security

Data Quality, Security Best Practices, Metadata Tracking, Lineage Management

Database Management

SQL (Development, Query Optimization), MongoDB, Cassandra, HBase, Snowflake, Redshift, BigQuery

Development Practices

CI/CD Pipelines, Git, Agile Methodologies

Professional Experience

US Bank, Chicago, Illinois, USA

Role: Senior Data Engineer Apr 2023 – Present

Leading enterprise-scale financial data engineering initiatives supporting credit analytics, fraud detection, and portfolio risk monitoring through real-time and batch data pipelines.

Built complete ETL workflows using PySpark and AWS Glue to process credit card, loan, and payment datasets, enabling fast, accurate analytics for business teams.

Orchestrated complex batch and streaming workflows using Airflow DAGs, reducing manual intervention and increasing pipeline reliability across multiple datasets.

Integrated Kafka and Kinesis pipelines for real-time ingestion of transactional and payment feeds, providing near real-time analytics for fraud detection.

Collaborated directly with Machine Learning Engineers to establish the MLOps pipeline for real-time fraud detection. This included ingesting and preparing near real-time data feeds (Kafka/Kinesis) and ensuring the smooth deployment and scoring of XGBoost models within the production workflow.

Implemented advanced Spark SQL transformations to cleanse, enrich, and aggregate multi-terabyte datasets, improving downstream reporting and analytics speed.

Migrated legacy Teradata and Oracle warehouses into Snowflake, reducing query times and improving scalability of enterprise analytics pipelines.

Built Python-based anomaly detection frameworks to identify duplicate and inconsistent data, increasing data accuracy for reporting teams.

Developed metadata and lineage tracking solutions with AWS Glue Catalog and Apache Atlas, enhancing operational transparency across production pipelines.

Implemented cost-efficient S3 storage and EMR cluster architectures to optimize large-scale data processing while controlling cloud expenses.

Optimized Redshift and Snowflake queries for high-volume datasets, reducing ETL runtimes by 40% and improving dashboard responsiveness.

Automated monitoring and alerting frameworks using CloudWatch, EventBridge, and Airflow notifications, ensuring timely resolution of pipeline issues.

Developed and standardized reusable PySpark modules and Python scripts, leveraging the Databricks platform for collaborative development, execution, and increased team productivity.

Enabled multi-source API ingestion from VISA, Mastercard, and ACH networks, integrating data into centralized analytics platforms for operational insights.

Constructed interactive Power BI dashboards displaying credit performance, transaction volumes, and fraud trends for operational and management teams.

Applied dynamic partitioning, bucketing, and caching strategies in Spark and Hive to handle high-frequency transactional data efficiently.

Performed root cause analysis on failed jobs, troubleshooting data and pipeline issues to maintain stability and reliability.

Designed high-throughput Spark streaming pipelines to support near real-time fraud analytics and credit scoring workflows.

Collaborated with analysts and data engineers to implement reusable transformations and workflows, increasing team efficiency and reducing duplicated effort.

Implemented full-stack CI/CD using Git/GitHub and Jenkins to containerize ETL applications with Docker and orchestrate large-scale deployment via Kubernetes, ensuring automated, scalable delivery.

Sentry Insurance, Stevens Point, Wisconsin, USA

Role: Data Engineer Jan 2021 – Mar 2023

Engineered large-scale data pipelines to support insurance analytics, actuarial modeling, and claims optimization across multi-source data systems.

Managed ETL pipelines using PySpark, Talend, and Informatica to consolidate policy, claims, and actuarial datasets into Snowflake for corporate analytics and risk modelling.

Created Spark SQL transformations on Hadoop clusters for high-volume policy and claims data, enabling faster loss-ratio modeling and actuarial reporting.

Implemented Kafka streaming pipelines to ingest telematics and external claims feeds, providing low-latency data for near real-time claims and fraud scoring.

Orchestrated Airflow workflows to automate batch and streaming pipelines, reducing manual monitoring efforts and increasing reliability.

Built OLAP cubes and pre-aggregated tables for actuaries and claims analysts to accelerate query performance on large-scale datasets.

Migrated Sybase and Oracle databases into Snowflake and Redshift, improving query speed and scalability for analytical workloads.

Developed Python-based data validation scripts to detect anomalies and inconsistencies, enhancing data quality for underwriting portfolio analysis.

Implemented metadata management and lineage tracking to maintain transparency across 100+ market data feeds.

Automated claims reserving and loss forecasting pipelines using Spark and Airflow, streamlining reporting for actuarial and regulatory compliance teams.

Optimized SQL stored procedures and Spark configurations to enhance ETL efficiency and reduce execution times on multi-terabyte datasets.

Designed and implemented cost-efficient, serverless ETL solutions using AWS Glue and Lambda functions to process daily catastrophe modelling and reference data feeds.

Managed high-throughput data lake storage on S3, configuring IAM roles for secure access and leveraging EC2 for customized processing environments to support analytical teams.

Implemented monitoring and alerting frameworks to ensure timely detection and resolution of pipeline issues across batch and streaming workflows.

Integrated structured and semi-structured policy, claims, and external risk data into centralized warehouses for faster analytics.

Constructed interactive Tableau and QlikView dashboards to visualize claims frequency, reserve adequacy, and loss ratios for executive and underwriting teams.

Applied best practices for Spark cluster resource management, partitioning, and caching to optimize processing of high-volume policy and exposure data.

Collaborated with cross-functional teams to implement multi-source pipelines and support analytics needs across Underwriting, Actuarial, and Claims departments.

Built automated scripts for reconciliation of incoming premium and external risk data feeds, ensuring data consistency and operational accuracy.

Designed reusable templates for ETL workflows, enabling faster onboarding and development for new datasets and feeds.

CommonSpirit Health, Phoenix, Arizona, USA

Role: Data Engineer Jul 2018 – Dec 2020

Designed and maintained data pipelines for healthcare analytics and compliance reporting, integrating clinical, claims, and operational data for payer-provider insights.

Managed ETL pipelines with PySpark, AWS Glue, and Airflow to process core healthcare datasets including member eligibility, claims (CMS/UB), and provider contracts.

Built Hadoop and Spark-based pipelines for batch and streaming data processing, improving throughput and enabling faster clinical and utilization analytics.

Implemented Python validation frameworks to detect anomalies and enforce HIPAA/PHI compliance, improving data quality for payer-provider analytics.

Integrated Kafka streams for near real-time ingestion of electronic health records (EHR) and prior authorization requests into Redshift and Snowflake.

Built interactive Power BI dashboards to visualize claims processing speed, member utilization, and provider cost efficiency for management reporting.

Optimized PySpark transformations and SQL queries, reducing ETL runtimes and improving performance on high-volume clinical and operational datasets.

Automated Airflow workflows with error handling and alerting to maintain pipeline reliability and reduce downtime.

Developed metadata and lineage tracking for all production pipelines using AWS Glue Catalog, enhancing operational visibility and auditing.

Migrated on-prem Oracle and Teradata warehouses to Snowflake, improving scalability, query performance, and maintainability.

Applied partitioning, bucketing, and caching strategies in Spark and Hive for efficient processing of high-frequency claims and utilization transactions.

Built reusable Python and Spark modules to standardize ETL transformations and accelerate development across projects.

Implemented cost optimization strategies on AWS clusters, balancing processing speed with cloud infrastructure expenses.

Conducted root cause analysis on failed jobs, troubleshooting pipeline and data issues to maintain stability and accuracy.

Integrated multi-source datasets including claims, clinical, and patient engagement data to support advanced analytics.

Managed high-throughput Spark and Hadoop clusters to ensure scalable and reliable processing of core member and claims datasets.

Automated reporting pipelines for daily and monthly utilization and clinical analytics, reducing manual effort and improving accuracy.

Collaborated with analysts to provide reusable data models and transformations, improving efficiency across multiple analytics workflows.

Contact this candidate