Sai Koushik Thati
Phone - +1-314-***-**** ***************@*****.***
SUMMARY
Data Engineer with 4+ years of experience designing and deploying scalable data pipelines, real-time streaming solutions, and ML-integrated workflows across healthcare, finance, and tech domains. Expertise in building ETL/ELT systems using Python
(PySpark), SQL, Apache Spark, and Kafka on cloud platforms including Azure, GCP, and AWS. Proven success in optimizing data architecture, ensuring regulatory compliance (HIPAA, PCI-DSS, GxP), and automating CI/CD using tools like Airflow, Terraform, and Docker. Strong collaboration with cross-functional teams to deliver analytics and ML solutions that drive business outcomes. SKILLS
Programming & Scripting: Python, SQL, Scala, Java, UNIX Shell Scripting, T-SQL, PL/SQL Big Data & Distributed Processing: Apache Spark (Scala, PySpark), Hadoop (HDFS, MapReduce), Hive, Kafka, EMR, HBase ETL Tools: Apache Airflow, AWS Glue, Informatica PowerCenter, Talend, SSIS, DataStage, Sqoop, Oozie, Prefect, Luigi Cloud Platforms & Services: Microsoft Azure (Data Factory, Synapse Analytics, Data Lake Gen2, Stream Analytics, Databricks), Google Cloud Platform (BigQuery, Vertex AI, Cloud Composer, Pub/Sub, Dataplex, Cloud Functions), Amazon Web Services (EC2, S3, Glue, Redshift, Athena, Lambda, DynamoDB, Kinesis) Databases & Data Warehousing: Snowflake, Teradata, PostgreSQL, Oracle, MySQL, SQL Server, BigQuery, Azure Synapse, Redshift, DynamoDB
DevOps & CI/CD: Docker, Jenkins, Kubernetes, Azure DevOps, Terraform, Git, Cloud Build (GCP), GitOps Streaming & Real-Time Processing: Apache Kafka, Azure Stream Analytics, Amazon Kinesis, Apache Spark Streaming Machine Learning & Libraries: Scikit-learn, TensorFlow, PyTorch, XGBoost, Keras, Pandas, NumPy, Matplotlib, Seaborn Data Visualization & BI Tools: Power BI, Tableau, Microsoft Excel, QlikView, IBM Cognos, QuickSight, SSRS Governance & Compliance: HIPAA, PCI-DSS, GxP, Data Lineage, Metadata Management, Data Quality Project Methodologies: Agile, Scrum, SDLC, Waterfall EXPERIENCE
Johnson & Johnson NJ Data Engineer Aug 2024 - Present
Engineered large-scale ETLworkflows using Python (PySpark), SQL, and Azure Data Factory to process over 5TB of healthcare data, boosting throughput by 40%.
Designed and administered Azure Data Lake Gen2 and Synapse Analytics environments; applied partitioning and indexing to cut query execution time by 30%.
Enabled real-time stream data processing for patient monitoring systems by integrating Apache Kafka and Azure Stream Analytics, maintaining 99.9% system uptime.
Enforced HIPAA/GxP compliance viacomprehensive datalineage, automated validation with Great Expectations, and rigorous metadata governance.
Streamlined workflow deployment using Apache Airflow, Docker, and Azure DevOps, reducing manual interventions and deployment cycles by 50%.
Productionized machine learning pipelines, incorporating feature engineering for logistic regression and random forest models, improving clinical outcome predictions by 22% accuracy.
Partnered with Data Scientists and Analysts to deploy ML models and Power BI dashboards, improving clinical trial insights. Google CA Data Engineer July 2021 – May2023
Orchestrated ETL/ELT data flows using Cloud Composer (Airflow) to manage 30+ BigQuery workflows daily, adding retry logic and SLA monitoring for reliability.
Built real-time feature stores using Vertex AI and Pub/Sub, enabling ML inference with <100ms latency.
Accelerated model development cycles by 30% by standardizing workflows with PySpark on Dataproc, scikit-learn, and TensorFlow, integrated via Vertex AI Pipelines.
Enhanced Google Cloud Platform (GCP) data security by implementing BigQuery column-level encryption, VPC-SC for network isolation, and granular IAM policies for least-privilege access.
Automated provisioning of GCP resources (Composer, Dataflow) using Terraform, enabling GitOps-driven CI/CD with Cloud Build and zero-downtime deployments.
Led initiative to migrate monolith pipelines into modular, domain-driven data products, increasing data discoverability by 50% through Dataplex and Data Catalog.
Drove cross-functional GCP adoption by leading workshops and creating documentation on Cloud Storage, Pub/Sub, and Dataflow, improving stakeholder alignment by 25%.
American Express India Data Engineer January 2020 - July 2021
Developed real-timestreaming ETLpipelines with Apache Spark (Scala/Python) and Kafka, slashing fraud detection lag from hours to seconds across 2M+ daily transactions.
Built Customer 360 data marts by integrating transactional, CRM, and clickstream data in Hadoop/Hive, resulting in 20% higher marketing campaign conversion rates.
Created a data quality monitoring system with Python and Great Expectations, decreasing data quality incidents by 60% and ensuring PCI-DSS compliance.
Implemented serverless data transformations using AWS Glue (PySpark) and Lambda, cutting pipeline upkeep efforts by 35%.
Applied column-level encryption (AES-256) and row-level access policies in Snowflake and Teradata, safeguarding 5M+ PII records with <5% performance penalty.
Managed petabyte-scale HDFS environments, leveraging Hive partitioning to optimize performance of financial reporting queries by 40%.
Collaborated with Risk and Finance teams to deploy fraud detection models (XGBoost, Random Forest) in production, improving fraud recall by 15% with <1% false positives. EDUCATION
Master of Science in Data Analytics
Saint Louis University, MO, USA
CERTIFICATIONS
Google Data Analytics Certificate
Google Cloud Data Engineer Certified
Azure Data Engineer Certified