NIPA SHAH
Data Scientist AI / ML Engineer
Jersey City, NJ • +1-551-***-**** • ************@*****.*** • LinkedIn • GitHub: nipa-analytics P ROFESSIONAL S UMMARY
Results-driven Data Scientist and AI/ML Engineer with 7+ years of experience designing end-to-end machine learning pipelines, deploying production-grade models, and translating complex data into strategic business insights. Proven expertise in NLP, LLMs (BERT, GPT, HuggingFace), healthcare analytics (MIMIC-IV EHR data), and cloud-based MLOps (AWS SageMaker, Docker, MLflow). Skilled in building scalable ETL workflows, real-time dashboards, and AI-powered solutions using Python, SQL, and leading ML frameworks. Adept at collaborating cross-functionally with product, engineering, and leadership teams. Seeking a Data Scientist role in the U.S. where I can drive measurable impact through advanced analytics and AI innovation. T E CHNICAL SKILLS
Languages Python, R, SQL, Jupyter Notebook, Git / GitHub ML / AI scikit-learn, XGBoost, LightGBM, TensorFlow, Keras, BERT, GPT, Transformers
(HuggingFace)
LLM / GenAI LangChain, Prompt Engineering, Embedding Search, Fine-Tuning LLMs (OpenAI), VectorDB, Pinecone, FAISS
MLOps Flask, FastAPI, AWS SageMaker, Docker, MLflow, CI/CD Pipelines, Model Monitoring & Drift Detection
Cloud & Big Data AWS (S3, EC2, SageMaker), Azure (ADF), Hadoop, Spark, Hive, Pig Databases PostgreSQL, MS SQL, Oracle, Snowflake, BigQuery, Redshift, Aurora ETL / Integration Airflow, Dataiku, Alteryx, Informatica, Pentaho, Knime Visualization Tableau, Power BI, Looker, QlikView, Domo, Excel (Advanced), ggplot Healthcare EHR Data Modeling, MIMIC-IV, Patient-Level Feature Engineering, Risk Modeling, Cohort Analysis, Population Health Analytics
Statistics A/B Testing, Hypothesis Testing, Regression, Clustering, NLP, Bayesian Analysis C E RTIFICATIONS
Databricks – Generative AI Fundamentals • AWS – Cloud Practitioner (CLF-C02) • Google – Looker Studio for Dashboards • Python Essential Training • Data Science Foundations • Generative AI Fundamentals • MySQL Essential Training • Supply Chain Basics
P ROFESSIONAL E X P E R I E NCE
Data Scientist AdvanceInnovative LLC – New Jersey June 2024 – Present
• Collected, cleaned, and preprocessed structured and unstructured datasets using Python (Pandas, NumPy) and SQL across healthcare, retail, and business domains, enabling efficient downstream ML modeling.
• Built and deployed classification, regression, and clustering models using scikit-learn, XGBoost, and LightGBM; improved prediction accuracy by up to 25%.
• Developed end-to-end NLP pipelines using BERT and GPT models (HuggingFace Transformers) for sentiment analysis, document summarization, and named entity extraction.
• Deployed ML models via Flask APIs and AWS SageMaker; integrated Docker containers and MLflow for reproducible, scalable production pipelines.
• Designed and maintained Airflow-orchestrated ETL workflows for real-time model inputs and consistent data delivery across platforms.
• Tracked post-deployment model performance using drift detection frameworks; retrained models proactively to maintain long-term business accuracy.
• Performed EDA and statistical testing (A/B testing, hypothesis testing) to uncover patterns; visualized insights in Power BI and Seaborn for executive stakeholders.
• Collaborated cross-functionally with product managers, analysts, and engineers to translate business objectives into data-driven solutions.
Healthcare Analytics Projects (MIMIC-IV EHR)
Advanced clinical ML projects using real-world ICU/EHR data (MIMIC-IV Clinical Database) — demonstrating production-grade healthcare data science capabilities sought by health-tech, pharma, and hospital systems. Patient Segmentation & Retention Analytics GitHub 2024–2025
• Engineered patient-level features from MIMIC-IV admissions, diagnoses, and ICU tables using SQL-style joins to capture clinical complexity, care intensity, and longitudinal engagement patterns.
• Applied Python-based preprocessing (feature scaling, encoding) and K-Means clustering to identify distinct patient cohorts supporting care management and population health programs.
• Generated actionable insights for patient retention strategy, cohort-level risk analysis, and hospital resource optimization — mirroring real-world payer and provider analytics workflows.
• Addressed real clinical data challenges: sparse records, high-dimensional categorical features, and irregular longitudinal histories.
Stack: Python · Pandas · Scikit-learn · SQL · Jupyter Notebook · K-Means Clustering 30-Day Hospital Readmission Risk Prediction & Care Prioritization GitHub 2024–2025
• Designed a readmission risk prediction model using patient-level features derived from ICU stays, diagnoses, demographics, and admission history via SQL-style feature engineering in Python.
• Built a complementary patient segmentation framework using MiniBatchKMeans to identify high-risk cohorts and prioritize proactive care interventions.
• Applied full ML pipeline: missing value handling, categorical encoding, feature scaling, model training, and performance evaluation (AUC-ROC, precision-recall).
• Delivered insights supporting care coordination, hospital quality improvement (HCAHPS/CMS metrics), and readmission penalty reduction strategies aligned with U.S. healthcare regulations. Stack: Python · Pandas · NumPy · Scikit-learn · SQL · MiniBatchKMeans · Jupyter Notebook ICU Deterioration Risk Prediction GitHub 2025
• Built a machine learning model to predict early deterioration of ICU patients using MIMIC-IV vitals, lab results, and clinical observations.
• Engineered time-series features capturing physiological trends, early warning score proxies, and clinical event sequences.
• Evaluated model performance using clinical validation metrics (AUROC, sensitivity/specificity) relevant to clinical decision support systems.
Stack: Python · Pandas · Scikit-learn · MIMIC-IV · Jupyter Notebook Data Analyst Code-Criteria Labs May 2018 – December 2023
• Owned end-to-end revenue data curation via ETL pipelines; automated daily distribution of revenue reports organization-wide, saving 500+ hours annually.
• Conducted Point of Sale (POS) analysis on major retailers (Amazon, Walmart, SharkNinja) to identify category trends in actualized and forecasted sales data.
• Built and maintained forecasting models; performed price audits and negative inventory audits to reduce dynamic pricing and inventory errors.
• Migrated data from legacy systems to Snowflake data warehouse for international regions, improving query performance and data governance.
• Developed KPI dashboards in Domo and QlikView for leadership and cross-functional teams using automated datasets; reported monthly revenue numbers across international regions.
• Performed gap analysis, root cause analysis, and data mining for cross-functional stakeholders; gathered business requirements and authored technical documentation.
• Conducted UAT approvals with IT for data governance systems, mentored new analysts on processes and systems.
Jr. Data Analyst – Intern TATA Motors Ltd. – Ahmedabad, India June 2017 – August 2017
• Leveraged PMG catalog suite to automate business processes across R&D teams.
• Developed SQL reports for stakeholders and calculated KPIs for performance analysis. E DUCATION
Master of Science, Business Analytics GPA: 3.75 / 4.00 Sacred Heart University – Fairfield, CT May 2025
VOLUNTEER & T E A CHING
Graduate Teaching Assistant (Volunteer) Sacred Heart University – Applied Statistics Jan 2025 – May 2025
• Mentored 20+ graduate students with statistical assignments, data interpretation, and structured problem-solving techniques.
• Conducted academic guidance sessions on applied statistics, analytical methods, and coursework requirements; managed midterm/final project workflows.
• Served as liaison between faculty and students to ensure smooth communication and academic delivery.