SASHANK SAI
Email: Mobile: +1-937-***-****
PROFESSIONAL SUMMARY
Experienced Data Engineer with over 7 years of expertise designing, developing, and optimizing large-scale ETL and data integration pipelines across AWS, Azure, on-prem SQL Server and GCP environments.
Proven track record of building and maintaining over 200 robust ETL workflows using AWS Glue, Azure Data Factory, and SSIS to support mission-critical analytics and reporting systems. Skilled in architecting scalable data warehouses and lakes on Amazon Redshift, Azure Synapse, and SQL Server, handling multi-billion record datasets with sub-second query performance. Extensive experience in AWS ecosystem, including Glue, Redshift, Kinesis, Lambda, Step Functions, SageMaker, CloudWatch, and CodePipeline, driving automation, real-time processing, and CI/CD efficiencies.
Strong expertise in optimizing query performance and database operations, with demonstrated success improving execution speed by up to 80% through indexing, query tuning, and cluster management. Adept at implementing data quality frameworks, automated validation, and end-to-end data lineage documentation to ensure compliance with SOC 2, HIPAA, and ISO 27001 standards. Collaborative partner with analytics, data science, and BI teams to deliver predictive models, interactive dashboards, and actionable insights supporting business growth and risk mitigation. TECHNICAL SKILLS
PROFESSIONAL EXPERIENCE
Rocket Mortgage Detroit, MI
Data Engineer Jan 2023 - Present
Built and deployed 48+ AWS Glue ETL pipelines using Python to ingest, clean, and transform mortgage data from 12+ source systems, centralizing disparate datasets into Amazon Redshift for enterprise analytics.
Designed scalable Redshift schemas managing over 2 billion records, optimizing sort keys and compression techniques to reduce scan volume by 38%, achieving sub-second response times for complex queries.
Developed real-time streaming pipelines with Amazon Kinesis to process 1.2 million customer interaction events daily, enabling marketing teams to act on behavioral data within 10 seconds. Automated 65+ Python-based data quality validation scripts triggered by AWS Lambda, resulting in a 98% reduction in data defects and increased confidence in reporting outputs. Tuned Redshift clusters by resizing nodes and refining workload management, improving query execution times by 52% on tables exceeding 800 million rows and supporting a 35% increase in concurrent users.
Led migration of 10TB mortgage data from on-prem SQL Server to AWS RDS, orchestrating parallel data loads and validation checks to ensure zero downtime and maintain full data integrity. Orchestrated multi-step ETL workflows using AWS Step Functions and Lambda, reducing manual orchestration time by 60% and enabling faster recovery from partial failures. Established automated CI/CD pipelines with AWS CodePipeline, CodeBuild, and Git for Glue ETL jobs, reducing deployment times from 10 days to 3 days and cutting release errors by 80%. Configured comprehensive CloudWatch monitoring and alerting for over 100 ETL jobs, enabling proactive detection and automated remediation of 95% of potential SLA breaches. Integrated machine learning models built in AWS SageMaker with Redshift and S3 data, improving mortgage delinquency risk prediction accuracy by 18% and supporting proactive borrower outreach. Designed, developed, and maintained 120+ stored procedures and user-defined functions within Microsoft SQL Server, enabling accurate mortgage risk calculations and operational reporting for multi-terabyte datasets.
Created and optimized SSIS packages automating batch ETL workflows for on-premises mortgage data sources, improving load efficiency and reliability by 40%, reducing manual intervention. Developed SSAS cubes and multidimensional data models to enable business users and analysts to perform detailed drill-downs into mortgage portfolio performance and risk metrics. Built and enhanced SSRS reports delivering executive dashboards and operational insights, improving report generation speed and accuracy by 30%, driving data-driven decision making. Tuned on-prem SQL Server indexes, statistics, and complex queries, reducing runtime for critical mortgage reports by up to 50%, improving data availability for business users. United HealthGroup Detroit, MI
Data Engineer Feb 2021 - Jan 2023
Developed 40+ ingestion pipelines using Azure Data Factory to consolidate healthcare claims from APIs, flat files, and SQL sources into Azure Synapse for enterprise analytics. Optimized Synapse SQL queries processing 2.8 billion+ records, reducing runtime from 8 minutes to under 90 seconds and improving report delivery times. Designed role-based access controls in ADLS via Azure Active Directory to secure protected health information (PHI), ensuring HIPAA compliance.
Authored 90+ stored procedures to automate transformation logic for eligibility and claims, increasing processing throughput by 35%.
Created serverless data transformation workflows with Azure Functions processing over 500K transactions hourly, supporting near real-time reporting. Implemented real-time ingestion with Azure Event Hubs delivering claim updates within 10 seconds to downstream systems for faster decision-making.
Migrated 6TB of on-prem SQL Server data to Azure SQL Database using parallel loads and integrity validation with zero data loss.
Automated Python-based validation scripts to reconcile 100% of incoming data feeds against business rules, reducing manual errors drastically.
Enforced encryption in transit and at rest with Azure-managed keys to comply with HIPAA and HITRUST security frameworks.
Delivered 30+ ML experiments using Azure ML for readmission risk prediction, boosting model precision by 20% and supporting clinical decisions. Built monthly batch pipelines in Azure Data Factory to process 250 million+ claims records for actuarial and financial reporting.
Integrated Azure Monitor with alerting on pipeline failures, enabling incident response times under 15 minutes and improving uptime.
Implemented CI/CD automation for data pipelines with Git and Jenkins, reducing manual release efforts by 70% and deployment errors.
Documented data lineage for 200+ datasets from ingestion through reporting, enhancing audit readiness and impact analysis for schema changes.
Partnered with BI teams to deliver optimized Power BI datasets servicing over 1,500 active users, improving dashboard refresh speeds by 30%.
Vine Brook Homes Bangalore, IN
Data Engineer Jan 2020 - Jan 2021
Developed 75+ ETL pipelines using GCP Dataflow to integrate property, leasing, and maintenance data from 10+ sources, supporting operational and financial reporting across business units. Created 110+ BigQuery stored procedures and 40+ user-defined functions to enforce complex real estate business rules during data transformation, enhancing data accuracy and consistency. Tuned database performance with BigQuery partitioning, clustering, and query optimization techniques, reducing report runtimes by up to 80% on critical operational datasets. Designed scalable relational schemas supporting 3 million+ property records in BigQuery, enabling comprehensive financial and operational analytics across departments. Implemented data validation pipelines with Dataflow and Cloud Functions to detect and correct anomalies in 100% of daily ETL loads, significantly improving overall data quality. Led migration of legacy Microsoft Access and Excel datasets into Cloud SQL and BigQuery, consolidating 15 years of historical property and leasing data for unified access. Automated daily, weekly, and monthly data refresh jobs with Cloud Composer workflows, improving data availability SLAs by 25% for reporting teams. Designed batch ETL jobs processing over 500K transaction records per load with zero failures, ensuring timely delivery of data for downstream reporting deadlines. Implemented Change Data Capture (CDC) using Pub/Sub and Dataflow to enable efficient incremental updates to dependent systems, cutting load times by 60%. Enforced row- and column-level security using BigQuery IAM policies and Data Catalog tagging to protect sensitive tenant and payment data, meeting compliance standards. Authored ad-hoc SQL queries for finance and compliance teams in BigQuery, reducing time-to-insight from hours to minutes and supporting critical decision making. Documented data lineage for 85 datasets using Data Catalog, enhancing transparency and audit readiness for internal and external stakeholders.
Integrated third-party APIs for property valuation and market trend data ingestion via Cloud Functions, enriching internal datasets for executive strategic planning. Monitored ETL workflows with automated failure detection and recovery using Cloud Composer and Stackdriver, achieving 99.8% pipeline uptime.
Designed optimized data models and dashboards in Looker, enabling executives to track KPIs and property performance metrics in near real-time.
State Farm Bangalore, IN
Data Engineer Mar 2018 - Jan 2020
Built 50+ AWS Glue ETL pipelines processing insurance policy, claims, and customer data from 8+ source systems, enabling unified analytics.
Modeled Redshift schemas for actuarial analysis, efficiently aggregating 1.5 billion+ transaction records to accelerate reporting.
Implemented real-time ingestion pipelines with Amazon Kinesis, delivering claim event data to fraud detection systems within seconds.
Automated 80+ Python-based data validation scripts, increasing data trustworthiness for regulatory and operational reporting.
Tuned Athena queries on S3 datasets to reduce annual execution costs by $12,000 while maintaining sub-minute response times.
Migrated 5TB of structured and semi-structured data from on-premises sources to AWS RDS and S3, maintaining full data integrity.
Executed Hive and PySpark batch jobs on EMR to process 2.2TB of historical claims data, supporting actuarial model training.
Orchestrated 60+ multi-step workflows with AWS Step Functions, improving process reliability and reducing failure rates by 30%.
Configured CloudWatch metrics and alarms for over 100 jobs, enabling early detection and resolution of 90% of failures before SLA impact.
Applied AES-256 encryption and AWS KMS for data protection, achieving ISO 27001 and SOC 2 compliance.
Developed Redshift UDFs to streamline actuarial calculations, reducing processing time by 55%. Implemented CI/CD pipelines using AWS CodePipeline, cutting Glue job release cycles from 2 weeks to 3 days.
Documented lineage for 120+ datasets across ingestion and analytics layers, supporting compliance audits and traceability.
Optimized stored procedures to process policy updates 2.5x faster, improving operational throughput. Delivered aggregated datasets feeding downstream ML models, enhancing fraud detection accuracy by 14%.
EDUCATION
Master’s in Computer Science, University of Dayton Dayton, OH