Sakshi Shah
Email: *************@*****.*** Phone: +1-551-***-****
PROFESSIONAL SUMMARY:
• Accomplished Data Engineer with 8+ years of experience designing scalable data pipelines, building cloud-native analytics architectures, and implementing secure, automated data workflows across healthcare, finance, and telecom domains. Proven ability to translate complex data into business value using AWS, GCP, and Azure platforms.
• Designed and optimized 50+ ETL/ELT pipelines using Apache Airflow, AWS Glue, and Azure Data Factory, reducing pipeline runtimes by up to 40% and improvingdata availability SLAs.
• Migrated legacy data infrastructure to Amazon Redshift and Google BigQuery, improving query performance by 35– 45% and reducing cloud storage costs.
• Built real-time streaming systems using Kafka, Spark, GCP Pub/Sub, and Dataflow, enabling proactive decision- makingwith sub-second latency.
• Integrated Generative AI models (OpenAI, Hugging Face) into data pipelines for automated document summarization and trend detection, improving data consumption speed by 30%.
• Led CI/CD automation using GitLab CI, Kubernetes, and Terraform, cutting manual deployment time by 70% and enhancing infrastructure reliability.
• Created actionable dashboards with Power BI, QuickSight, and Looker to support finance and healthcare decision- makers with real-time insights and anomaly detection.
• Automated anomaly detection and data quality checks using Great Expectations and Airflow DAGs, reducing incident response time by 60%.
• Orchestrated multi-cloud observability using CloudWatch, Azure Monitor, and Stackdriver, providing end-to-end monitoringacross pipelines and infrastructure.
• Collaborated cross-functionally with data scientists, product teams, and compliance stakeholders to deliver data products that improved operational KPIs by 20–30%. TECHNICAL SKILLS:
Scripting Languages Python, SQL, Bash, PowerShell, Java, Scala, PySpark, YAML, R Databases SQL, MySQL, SAP HANA, CosmosDB, MongoDB, PostgreSQL, Delta Lake AWS Services AWS S3, AWS Redshift, AWS RDS, AWS DynamoDB, AWS Glue, AWS Data Pipeline, Kinesis, EMR, Amazon Aurora, Kinesis, Amazon QuickSight Visualization Tool Tableau, Power BI, Looker, Microsoft Excel, Data Lineage Tools ETL Tools
Alteryx, Apache NiFi, AWS Glue, Azure Data, Informatica, DLT Factory (ADF), Apache Spark, Talend, Aws Lambda, DBT, SSIS Packages Pandas, NumPy, Matplotlib
AI/ML Tools Azure ML Studio, TensorFlow, PyTorch, Keras, AWS SageMaker, Azure ML, Scikit –learn, BigQuery ML, Azure ML Studio, MLOps GenAI OpenAI API, ChatGPT, Hugging Face Transformers Big Data Apache Hadoop, Apache Kafka, Apache Flink, Hbase Cloud Ecosystem &
DevOps
AWS, Azure, GCP (BigQuery), Snowflake, Docker, Kubernetes, CI/CD Pipelines, AWS DevOps, AWS Step Functions, Databricks on Azure and AWS Data Warehousing Azure Synapse Analytics, Google BigQuery, Snowflake, Star Schema, Snowflake Schema, Data Modeling
Version Control Git, Git Hub, Git Lab
Monitoring & Logging CloudWatch, Datadog, ELK Stack, Prometheus, Grafana Other SDLC, Agile Methodologies, Data Cleaning, Automation, Problem Solving, Critical Thinking, Root Cause Analysis, A/B Testing, Bugzilla, Data Cleaning, Root Cause Analysis, Schema Evolution, Data Governance, Medallion Architecture PROFESSIONAL EXPERIENCE:
NIKE – NEW YORK, USA
DataEngineer
• Architected and maintained scalable batch and streaming pipelines using Apache Airflow and Spark on AWS, enhancing ETL throughput and reducing data latency by 40%.
• Migrated legacy on-prem data systems to Amazon Redshift and Google BigQuery, improving query speed by 45% and cutting cloud infrastructure costs by 30%.
• Integrated OpenAI GPT models and Hugging Face Transformers into data pipelines to auto-summarize clinical notes and prior authorizations, reducing manual processing time by 60%.
• Designed ingestion frameworks with AWS Glue, S3, and Step Functions to support semi-structured healthcare data, increasing ingestion efficiency by 50%.
• Implemented GCP-native real-time data ingestion workflows using Pub/Sub and Dataflow, enabling sub-second event stream processing.
• Leveraged BigQuery ML to deploy classification models for claims data, improving fraud detection accuracy by 25%.
• Developed serverless validation workflows with AWS Lambda and Step Functions, improving data accuracy SLAs and reducing engineering overhead.
• Automated CI/CD workflows using GitLab, Docker, and Kubernetes, reducing deployment errors by 70% and enhancing pipeline reliability.
• Implemented HIPAA- and GDPR-compliant access controls and encryption using AWS IAM, KMS, and TLS, ensuring full data governance across cloud platforms.
• Collaborated with data scientists, analysts, and compliance stakeholders across product and engineering teams to design end-to-end data solutions aligned with HIPAA requirements and business KPIs, accelerating feature delivery by 30%. AMERICAN EXPRESS – NEW JERSEY, USA
DataEngineer
• Built real-time fraud detection pipelines using Apache Kafka and Flink, enabling near-instant anomaly detection and reducing fraud response time by 50%.
• Architected a data lakehouse solution using Azure Data Lake, Delta Lake, and Power BI, improving data accessibility and reducing reporting latency by 40%.
• Automated ELT workflows with Talend, Azure Data Factory, and Python, reducing manual data handling and pipeline failures by 65%.
• Integrated 10+ external APIs including Plaid, Experian, and Salesforce, automating ingestion processes and enriching customer data profiles by 30%.
• Developed segmentation dashboards in Power BI with drill-downs and dynamic filters, increasing marketing campaign efficiency by 25%.
• Implemented data quality monitoring using Great Expectations and Apache Airflow DAGs, reducing data validation errors by 45%.
• Created version-controlled transformation logic using DBT, improving pipeline maintainability and documentation consistency across teams.
• Designed Azure Synapse pipelines to join structured finance data with third-party enrichment feeds, streamlining compliance reporting workflows.
• Scheduled near real-time data workflows using Azure Event Grid and Functions, achieving sub-minute latency for high-priority event triggers.
• Developed Spark-based batch processing jobs in Azure Databricks to support credit risk modeling, accelerating model scoring by 35%.
• Secured sensitive data using Azure Key Vault and RBAC policies, ensuring full compliance with PCI DSS and internal security policies.
• Conducted training for analysts and product managers on interpreting Power BI dashboards, increasing data adoption across business units by 40%.
Dec 2022 - Present
Sep 2020 - Nov 2022
GRANDISON MANAGEMENT - NEW JERSEY, USA
DataEngineer
• Designed and deployed Hadoop-based batch data pipelines processing over 50TB/month, enabling scalable analytics across telecom network logs and supporting strategic decision-making.
• Engineered real-time network monitoring solutions using Apache Spark Streaming and Kafka, reducing incident response time by 40% and improving SLA adherence.
• Developed predictive maintenance models using PySpark MLlib to forecast telecom equipment failure, increasing system uptime by 30% and reducing field outages.
• Built and automated CI/CD pipelines for ETL deployments using Jenkins and Git, cutting manual release efforts by 60% and increasing deployment frequency.
• Orchestrated multi-stage data validation workflows using custom Python validators and Hive scripts, improving data accuracy and reducing QA cycle time by 50%.
• Maintained metadata governance with Apache Atlas, improving data lineage tracking, enhancing auditability, and meeting compliance mandates.
• Tuned and optimized Hive queries, reducing average job runtime by 40% and eliminating 60% of SLA breaches tied to reporting delays.
• Implemented data encryption and RBAC policies using Hadoop Ranger and Apache Knox, securing PII data and meeting internal security and compliance standards.
• Collaborated with DevOps and platform teams to containerize ETL jobs and automate deployment pipelines, accelerating data product delivery by 35%.
• Conducted root cause analysis (RCA) of data quality issues across multiple pipeline stages, leading to resolution of recurring data mismatches in less than 2 weeks.
• Integrated structured and semi-structured data from multiple telecom systems into a unified Hive-based warehouse, increasing data accessibility for analysts by 50%.
• Created data quality scorecards and dashboards in Apache Superset to track and visualize ETL success rates, enabling proactive monitoring.
• Led onboarding sessions for junior engineers on Hadoop ecosystem tools and data quality best practices, improving team productivity and reducing onboarding time.
• Partnered with business analysts and stakeholders to define SLAs and success metrics for data products, aligning engineering output with organizational goals.
EDUCATION
Bachelor in Business Administration
Narsee Monjee Institute of Management Studies (NMIMS) - Mumbai, India CERTIFICATION
• AWS Certified Data Analytics – Specialty
• Google Professional Data Engineer
• Microsoft Certified: Azure Data Engineer Associate
• Certified Apache Airflow Developer (Astronomer)
• Talend Data Integration Certified Developer
• HIPAA Awareness for Business Associates Certification Aug 2016 - May 2019
July 2019 – Aug 2020