Shanmukha Sai
Data Engineer
Location: Texas, USA 201-***-**** Email: *******************@*****.*** Linkedin SUMMARY
• Data Engineer with 5+ years of experience in designing, developing, and optimizing data pipelines using Apache Kafka, Apache Airflow, and Databricks for scalable and real-time data processing.
• Proficient in database management systems (PostgreSQL, MySQL, SQL Server, MongoDB) with strong expertise in SQL query optimization, data modeling, schema design, and performance tuning.
• Experienced with cloud data platforms including AWS (S3, EMR, Glue, Redshift, Kinesis) and Microsoft Azure for data storage, ETL, data warehousing, and analytics.
• Skilled in big data technologies such as Apache Spark, PySpark, and Hadoop for batch and streaming data processing across finance and healthcare domains.
• Advanced programming skills in Python (Pandas, NumPy, PySpark), Java, and R with a focus on data engineering, automation, and predictive analytics solutions.
• Expertise in ETL development, workflow automation, and data pipeline orchestration using Apache Airflow, Talend, and REST APIs to ensure data quality, governance, and reliability.
• Hands-on experience in machine learning operations (MLOps) and deployment of predictive models using Scikit-Learn, TensorFlow, and PyTorch to generate business intelligence and insights.
• Strong knowledge of data warehousing, distributed computing, real-time data streaming, and data integration. TECHNICAL SKILLS
Core Data Engineering: Strong SQL, Snowflake (Performance Tuning, Data Modeling, Snowpark), DBT, Apache Airflow/MWAA, Python (Pandas, NumPy, REST APIs)
Programming Languages: Python (Pandas, NumPy, Scikit-learn, XGBoost, Matplotlib, Seaborn, OpenCV), SQL, PySpark Data Warehousing &
Modelling:
Snowflake (Snowpark, Streams & Tasks, Zero-Copy Cloning, Performance Tuning), Delta Lake, Star Schema, Snowflake Schema, Data Vault 2.0
ELT & Data Processing: DBT (Data Build Tool), Apache Airflow, MWAA (Managed Apache Airflow), Apache Spark, Data Quality frameworks (Great Expectations), Real-time streaming and batch processing, CI/CD pipeline implementation
Cloud Services: S3, Lambda, IAM, CloudWatch, MWAA, Glue, Athena, Kinesis, CloudFormation, Lake Formation, ECR, Redshift, DynamoDB, Step Functions, SageMaker, Azure Cloud Data Integration: REST APIs, Apache NiFi, Real-time data processing, API automation and scripting AI/ML & RAG: LangChain, LlamaIndex, Open-source and paid LLM models (Llama2, Mistral, Open AI, Google Gemini Pro), Vector databases (Chroma DB, Pinecone), RAG workflows Databases: MySQL, PostgreSQL, MongoDB, Chroma DB, Pinecone, Snowflake, Redshift DevOps & Tools: Docker, Terraform, Git, GitHub Actions, CI/CD pipelines, Agile/Scrum methodologies Data Governance: Data quality, compliance, cloud-native troubleshooting, security best practices WORK EXPERIENCE
Data Engineer CVS Health, Texas Jan 2024-Present
• Designed and implemented Delta tables and dimensional models in the gold layer, transforming raw EHR, lab, and treatment data into analytics-ready datasets for clinical reporting and research insights.
• Automated data anomaly detection and validation using rule-based and statistical checks, ensuring accuracy of KPIs for patient outcomes, readmission rates, and lab metrics.
• Built external tables and views in Platinum layer for Power BI dashboards, APIs, and clinical research datasets, enabling real- time analytics for hospitals and healthcare providers.
• Engineered end-to-end pipelines for patients, encounters, diagnoses, treatments, and lab results, maintaining data integrity, deduplication, and schema enforcement across Bronze–Gold layers.
• Leveraged Delta Lake features (ACID transactions, schema evolution, incremental merges) for incremental data ingestion from EHR systems, ensuring regulatory compliance and audit readiness.
• Developed a Data Quality & Anomaly Detection framework, monitoring data freshness, volume discrepancies, and anomalies, reducing manual reconciliation effort.
• Collaborated with BI and clinical teams to curate datasets for population health analytics, predictive modelling, and clinical research, supporting data-driven decision-making.
• Implemented CI/CD for ETL pipelines using GitHub Actions, automating testing, deployment, and monitoring while ensuring HIPAA-compliant data handling through encryption, access controls, and audit logging to protect sensitive information.
• Optimized clinical analytics pipelines for performance, scalability, and low-latency reporting, enabling near real-time access to large-scale patient datasets.
• Integrated LLM models (Llama2, Mistral) for NLP-based clinical note analysis, summarization, and insights extraction, supporting advanced clinical research and decision-making. Data Engineer Tata Consultancy Services, India May 2018 - Nov 2022
• Designed and developed scalable real-time data pipelines using Apache Kafka Streaming to process, aggregate, and analyse millions of sales and customer events per day, reducing data latency to below 5 minutes.
• Developed and deployed sales prediction and customer segmentation models using Python (Pandas, Scikit-learn) and exposed insights via dashboards in Tableau and Amazon QuickSight for business stakeholders
• Implemented ETL workflows with Apache Airflow to automate ingestion, transformation, and loading of transaction and customer data from MySQL and MongoDB to Amazon Redshift, improving reporting efficiency by 40%
• Orchestrated robust data transformation pipelines using DBT (Data Build Tool), enabling version-controlled, modular ELT processes and documenting lineage for data models.
• Automated data ingestion, data enrichment, and event publishing workflows with RESTful API integrations and scripting in Python and Bash, reducing manual intervention and improving system flexibility.
• Ensured data quality, consistency, and governance by integrating validation checks, automated schema evolution, and monitoring with custom Airflow DAGs.
PROJECT 2
• Designed and implemented end-to-end ETL pipelines using Apache Spark, Python, and SQL to ingest and process high- volume financial transaction data in real-time.
• Built real-time streaming pipelines with Azure Event Hubs, Azure Stream Analytics, and Databricks, enabling instant fraud detection alerts and improving risk response time by 50%.
• Developed interactive Tableau dashboards integrated with curated datasets for ad-hoc financial analytics and KPI monitoring, enabling stakeholders to track transactions, detect anomalies, and make data-driven decisions
• Engineered and maintained Delta Lake and Azure Data Lake Storage Gen2 for structured and unstructured financial datasets, improving data reliability and compliance, resulting in zero critical data errors in production.
• Implemented finance-specific compliance checks (SOX, PCI DSS, GDPR) within ETL pipelines to ensure regulatory adherence and audit readiness. Collaborated with data architects and business teams to establish data governance. EDUCATION
University of North Texas Denton, TX
Master’s in data science
Gandhi Institute of Technology and Management Vizag, India Bachelor of Technology in Electronics and Communication Engineering PERSONAL PROJECTS
Pneumonia Detection and Explainability Python, Vision Transformers, k-NN, LLM, PyTorch
• Developed a Vision Transformer (ViT) model for classifying pneumonia from chest X-rays, achieving 90% accuracy in differentiating pneumonia from normal cases.
• Implemented k-nearest neighbors (k-NN) algorithm to retrieve similar medical cases, providing contextual insights to support model predictions.
• Integrated a Large Language Model (LLM) to generate detailed, human-readable diagnostic explanations, improving interpretability for healthcare professionals.
Severity Sensor Fault Detection Python, AWS, Flask, Git, Docker
• Developed and implemented a binary classification model using Python and machine learning algorithms to identify root causes of sensor faults in APS for heavy-duty vehicles, achieving 90% fault detection accuracy.
• Implemented Docker containerization for the machine learning application, enabling scalable deployment and reducing deployment time by 40%.
• Integrated containerized model with existing systems to ensure seamless operation and improved fault diagnosis workflows.