SAI KUMAR DATA ENGINEER
Mail: ************@*****.*** Mobile: +1-815-***-****
PROFESSIONAL SUMMARY
Data Engineer with 3 years of experience helping IT, Banking, and Healthcare clients turn complex data into meaningful insights. Skilled at Python, SQL, PySpark, Apache Spark, Airflow, AWS, and Redshift, I design and optimize data pipelines that are not only efficient but also scalable. I enjoy automating workflows, improving system performance, and solving real business problems with data-driven solutions. Always eager to learn the latest technologies, I thrive in collaborative environments, ensuring projects are delivered on time and exceed expectations. TECHNICAL SKILLS
Programming & Scripting: Python, SQL, PySpark, Java, Bash
Databases & Data Warehousing: PostgreSQL, MySQL, Oracle, SQL Server, Redshift, Snowflake
Big Data & ETL Tools: Apache Spark, Hadoop, Talend, Apache Airflow, AWS Glue
Cloud Platforms: AWS (S3, EC2, EMR, Lambda, CloudWatch), Azure
Data Visualization: Tableau, Power BI
Version Control & CI/CD: Git, GitHub, Jenkins
Other Tools & Frameworks: Kafka, Docker, Linux, REST APIs, Agile/Scrum PROFESSIONAL EXPERIENCE
Purevisitx Austin, TX
Data Engineer Dec 2024 – Present
Designed and implemented scalable ETL pipelines using Python 3.11, Apache Airflow 2.9, AWS Glue 4.0, and dbt Core 1.7, reducing data processing time by 35% and improving workflow reliability.
Built automated data quality checks and validation scripts with Python 3.11 and SQL (PostgreSQL 15), increasing accuracy of patient and claims data by 25%.
Migrated legacy on-premise SQL Server 2019 databases to Amazon Redshift RA3 nodes, optimizing query performance by 40% to support high-volume analytics.
Developed real-time streaming pipelines with Apache Kafka 3.6, Apache Spark 3.5 (Structured Streaming), and Delta Lake 3.0, enabling near real-time operational insights.
Implemented Data Lakehouse architecture using Apache Iceberg 1.4 on AWS S3, improving scalability, schema evolution, and governance of analytical workloads.
Collaborated with data scientists to integrate ML models into production pipelines using MLflow 2.12, improving patient risk prediction accuracy by 15%.
Optimized Redshift and Spark SQL transformations, boosting dashboard performance for clinical and operational analytics.
Automated infrastructure provisioning with Terraform 1.8, and containerized services using Docker 25.0 and Kubernetes 1.30, reducing deployment overhead.
Leveraged GitHub Actions (2025 LTS) and Jenkins 2.462 for CI/CD deployment, ensuring consistent and error-free production releases.
Created comprehensive technical documentation and best practices for ETL, dbt models, and Lakehouse workflows, ensuring smooth knowledge transfer.
Monitored and automated pipeline health with AWS CloudWatch, Datadog 2.0, and custom Python scripts, maintaining 99.9% uptime.
Regions Bank Birmingham, Alabama
Data Engineer Nov 2023 – Nov 2024
Designed and maintained financial data pipelines using PySpark and Hadoop, processing over 2TB of transactional data daily with high reliability.
Automated monthly reporting processes using Python, SQL, and Azure Data Factory, reducing manual effort by 50% and ensuring timely stakeholder insights.
Integrated REST APIs to pull real-time market and transactional data into centralized Azure Data Lake, improving data availability for analytics.
Implemented role-based data access and governance policies in Azure SQL and Azure Data Lake, enhancing compliance with GDPR and SOC2 standards.
Optimized ETL scripts and workflows in Azure Databricks, increasing pipeline efficiency and reducing cloud compute costs by 20%.
Built interactive dashboards in Tableau and Power BI for financial analytics, enabling actionable insights for risk management teams.
Conducted root cause analysis on data discrepancies, improving overall data reliability by 30% and enhancing reporting accuracy.
Collaborated with business analysts and product owners to define KPIs, data requirements, and reporting standards.
Leveraged Spark SQL to transform semi-structured and unstructured datasets into structured formats for downstream analysis.
Ensured high data integrity and system availability by implementing automated backup, recovery strategies, and monitoring using Azure Monitor.
Vivagene Hyderabad, India
Data Engineer Sep 2021 – Nov 2022
Assisted in designing and developing ETL workflows using Talend, Python, and AWS Glue to process healthcare datasets efficiently.
Performed data cleansing, transformation, and validation on patient and claims data to ensure high-quality, accurate datasets.
Managed PostgreSQL and MySQL databases, helping optimize queries and improve storage performance.
Supported migration of legacy Oracle databases to AWS Redshift, gaining experience in cloud-based data warehousing.
Automated simple reports for healthcare teams using SQL and Python, reducing manual reporting effort.
Applied basic indexing and partitioning strategies in Redshift and PostgreSQL to improve query speed.
Developed scripts using Python to monitor pipeline health and log errors, assisting in reducing production issues.
Collaborated with senior engineers and cross-functional teams to understand business requirements and implement scalable data solutions.
Assisted in data profiling and anomaly detection to improve system reliability and ensure clean patient data.
Learned and applied AWS CloudWatch for basic pipeline monitoring and alerting.
Supported CI/CD processes using Git, GitHub, and Jenkins for smooth deployment of ETL workflows.
Gained hands-on experience with Kafka and Apache Spark for processing streaming healthcare data. EDUCATION
Masters in advanced data analytics – University of North Texas Bachelors in computer science – Gitam University