data scientist

Location:

United States

Posted:

September 29, 2025

Contact this candidate

Resume:

RAVI TEJA

Email: **************@*****.*** PH: 832-***-****

Senior Data Scientist / ML Engineer

Professional Summary

Over 10+ years of expertise in building AI/ML solutions using Python, R, SQL, TensorFlow, PyTorch, and Scikit-learn, delivering scalable models across healthcare, pharma, and finance.

Strong background in deep learning architectures (CNN, RNN, LSTM, GRU, Transformers) with Keras, TensorFlow, and PyTorch Lightning for predictive analytics and medical diagnosis.

Hands-on experience with LLMs and Generative AI using Hugging Face, GPT, LangChain, and RAG pipelines, applying advanced prompt engineering for legal and healthcare domains.

Proficient in Natural Language Processing (NLP) with spaCy, NLTK, BERT, RoBERTa, and GPT-based transformers for classification, summarization, and entity recognition.

Expertise in time series forecasting using ARIMA, SARIMA, Prophet, and Transformer-based models, enabling demand prediction and resource optimization.

Skilled in MLOps with MLflow, DVC, GitHub Actions, Jenkins, and Azure ML Pipelines, ensuring reproducibility, governance, and automated deployment.

Built microservices and APIs for ML models using FastAPI, Flask, and TensorFlow Serving, containerized with Docker and orchestrated with Kubernetes (AKS/EKS/GKE).

Designed cloud-native AI pipelines with AWS SageMaker, Azure ML, and GCP Vertex AI, supporting end-to-end ML lifecycle management.

Experienced in Big Data platforms including Apache Spark, Databricks, Kafka, and Flink, processing high-volume clinical and financial datasets.

Developed ETL workflows with Apache Airflow, Azure Data Factory, Google Dataflow, and Talend, enabling automated ingestion and transformation at scale.

Created AI-powered dashboards with Tableau, Power BI, and Looker, embedding predictive insights into executive decision-making.

Applied model interpretability techniques like SHAP, LIME, and PDPs to ensure compliance and transparency in regulated industries.

Integrated OCR solutions with Azure Cognitive Services and Tesseract, extracting structured data for downstream NLP pipelines.

Technical Skills

Programming & Databases

Python, R, SQL, Scala, Bash, Shell Script, HTML, Markdown, PostgreSQL, MongoDB

Machine Learning & AI:

scikit-learn, XGBoost, LightGBM, CatBoost, PyCaret, TensorFlow, PyTorch, PyTorch Lightning, Keras, H2O.ai, DataRobot, TFX, AutoML

Generative AI & NLP:

Hugging Face, BERT, GPT-4, RoBERTa, DistilBERT, LangChain, LangGraph, RAG, spaCy, NLTK, FastText, Word2Vec, OCR, NER, Summarization

Time Series:

ARIMA, SARIMA, Prophet, LSTM, GRU, Transformer-based forecasting

MLOps & Deployment:

MLflow, DVC, FastAPI, Flask, TensorFlow Serving, Docker, Kubernetes, GitHub Actions, Jenkins, Ray Tune, Optuna, Hyperopt

Big Data & Streaming:

Apache Spark, Databricks, Kafka, Flink, Beam, Hive, Sqoop, HDFS

ETL & Orchestration:

Apache Airflow, Azure Data Factory, Google Dataflow, AWS Glue, Talend, dbt

Databases & Storage:

PostgreSQL, MongoDB, BigQuery, Snowflake, Redshift, S3

Visualization & BI:

Tableau, Power BI, Looker, Matplotlib, Seaborn, Plotly, Dash

Cloud Platforms:

AWS (SageMaker, Lambda, Glue, Redshift), Azure (ML, AKS, ADF, Synapse, Purview), GCP (Vertex AI, BigQuery, Dataflow)

PROFESSIONAL EXPERIENCE

Data Scientist / ML Engineer

AbbVie, Vernon Hills, IL Dec 2023 – Present

Responsibilities

Designed and implemented time series forecasting pipelines using Python, ARIMA, SARIMA, Prophet, and LSTM to optimize pharma supply chain planning and reduce delays by 20%.

Built deep learning architectures including CNNs, RNNs, and LSTMs in TensorFlow, Keras, and PyTorch Lightning, improving disease outcome predictions on longitudinal EHR datasets.

Automated end-to-end ML workflows in Azure ML Pipelines, integrating model training, validation, deployment, and monitoring for production workloads.

Containerized ML models with Docker and deployed to Azure Kubernetes Service (AKS), ensuring scalability and 99.9% production uptime.

Developed data pipelines with Azure Data Factory (ADF) to ingest, clean, and transform multi-terabyte datasets from SQL Server, Blob Storage, and third-party APIs.

Built Retrieval-Augmented Generation (RAG) pipelines with LangChain + ChromaDB, fine-tuning domain-specific LLMs for clinical document summarization and reducing manual review time by 40%.

Engineered agentic AI workflows using LangGraph and GPT APIs, automating legal contract clause extraction, similarity search, and summarization.

Designed and fine-tuned NLP pipelines using BERT, RoBERTa, and spaCy, enabling classification of medical claims and extracting ICD codes from unstructured clinical text.

Implemented real-time streaming ingestion using Apache Kafka, Spark Structured Streaming, and Azure Event Hubs, processing live clinical events for decision-making.

Developed scalable batch pipelines with PySpark on Databricks, reducing ETL runtimes by 30% across EHR datasets.

Created Tableau and Power BI dashboards to track prediction quality, model drift, and KPIs across therapeutic areas.

Applied explainable AI methods such as SHAP and LIME in Python, ensuring transparency for deep learning models under HIPAA compliance.

Improved classification and regression accuracy by 15% through hyperparameter tuning using Optuna and Ray Tune across PyTorch and scikit-learn models.

Integrated CI/CD pipelines with Azure DevOps, Git, and DVC, automating retraining, model versioning, and production deployment.

Environment: Python, TensorFlow, PyTorch, Azure ML, AKS, ADF, Databricks, FastAPI, LangChain, LangGraph, ChromaDB, Kafka, Spark Streaming, Tableau, Power BI, MLflow, DVC, Optuna, Ray Tune, Docker

Data Scientist / ML Engineer

Edward Jones, St. Louis, MO Aug 2021 – Nov 2023

Responsibilities

Migrated legacy analytics workflows from on-prem systems to GCP Vertex AI, BigQuery, Cloud Storage, and Dataflow, reducing pipeline execution time by 60% and lowering operational costs.

Designed and implemented ensemble ML models using XGBoost, LightGBM, and Stacking Classifiers in scikit-learn, increasing credit risk scoring accuracy by 18%.

Developed and deployed conversational AI assistants with Rasa NLU/Core and Dialogflow CX, integrated with backend APIs to automate onboarding and customer support.

Built a real-time fraud detection system using Apache Flink, Kafka Streams, and TensorFlow Serving, enabling sub-second predictions with <2% false positives.

Created production-grade ML APIs with FastAPI, Docker, and Kubernetes, deploying on GCP for scalable, low-latency endpoints.

Engineered CI/CD pipelines using GitHub Actions, Terraform, and Docker Compose, automating training and deployment for ML models in Vertex AI.

Designed feature engineering workflows with TensorFlow Transform (TFX), ensuring consistent data preprocessing across training and inference.

Automated ETL pipelines using Apache Airflow, orchestrating data extraction, transformation, and scoring workflows for financial datasets.

Built real-time dashboard solutions with Looker and Tableau, integrating BigQuery datasets for business decision-making.

Implemented explainable AI methods including SHAP, LIME, and model cards to improve model transparency and meet compliance requirements.

Built automated monitoring systems using Prometheus, Grafana, and GCP Stackdriver, tracking model drift, prediction accuracy, and API uptime.

Applied hyperparameter tuning techniques with Optuna and Hyperopt using Bayesian Optimization to maximize model accuracy.

Introduced data versioning with DVC integrated with GitHub, enabling reproducibility and collaborative ML workflows.

Worked closely with DevOps and Cloud Engineering teams to manage infrastructure using Terraform and GCP services, ensuring secure and efficient provisioning.

Led distributed training optimization for large transformer models in PyTorch on GPU-enabled clusters, orchestrated via Kubernetes and IaC provisioning with Terraform.

Environment: Python, scikit-learn, TensorFlow, PyTorch, Vertex AI, BigQuery, Dataflow, Dialogflow, Rasa, FastAPI, Docker, Kubernetes, GitHub Actions, Terraform, Airflow, Looker, Tableau, Prometheus, Grafana, DVC, Optuna, Hyperopt

Data Scientist / ML Engineer

Molina Healthcare, Bothell, WA Jan 2019 – Jul 2021

Responsibilities

Developed deep learning models including LSTM, GRU, and Transformer encoders in TensorFlow and Keras, predicting patient readmission risk with ROC-AUC above 0.85.

Leveraged AutoML platforms like H2O.ai and DataRobot to accelerate development of 100+ classification and regression models, cutting model prototyping time by 50%.

Designed and orchestrated ETL workflows with Apache Airflow and Spark, managing dependencies across HDFS, AWS S3, and on-premise data sources.

Built real-time data streaming pipelines with Apache Kafka and Spark Structured Streaming, enabling real-time monitoring of health event logs and claims data.

Created Power BI dashboards for executives, integrating predictive insights to monitor treatment effectiveness, hospital stay duration, and readmission rates.

Designed and deployed NLP pipelines using spaCy, NLTK, and Hugging Face Transformers, enabling ICD code prediction and clinical text summarization.

Fine-tuned BERT-based models for Named Entity Recognition (NER), extracting key medical codes from unstructured EHR and claims data.

Applied time series forecasting with Prophet and ARIMA to project inpatient volumes, reducing resource shortages by improving demand planning accuracy.

Built ensemble ML models combining Gradient Boosting (XGBoost), Random Forest, and Logistic Regression in scikit-learn, optimized with Optuna for hyperparameter tuning.

Engineered feature pipelines in Python (Pandas, NumPy) and SQL, performing categorical encoding, null value imputation, and temporal feature extraction for clinical datasets.

Implemented MLflow for tracking experiments, model versioning, and deployment-ready artifacts, ensuring reproducibility across teams.

Integrated with AWS SageMaker and EC2 GPU clusters, enabling scalable distributed training and deployment of clinical prediction models.

Documented workflows with Jupyter Notebooks, Git, and Markdown, ensuring reproducibility and knowledge transfer for healthcare ML solutions.

Environment: Python, TensorFlow, Keras, PyTorch, Spark, Airflow, AWS SageMaker, EC2, S3, Hugging Face, spaCy, NLTK, BERT, Power BI, MLflow, DVC, Git, Pandas, NumPy, SQL, Optuna, R

Associate Data Scientist

IBing Software Solutions, Hyderabad, India Jun 2015 – Oct 2018

Responsibilities

Designed and implemented classification models in Python (scikit-learn) including SVM, Logistic Regression, and Random Forest, improving marketing campaign response prediction accuracy by 25%.

Built NLP pipelines using TF-IDF, LDA, spaCy, and Gensim, enabling topic modeling and document clustering for customer feedback analysis.

Engineered Word2Vec-based semantic vector models in Python with Gensim, improving contextual understanding of customer text data.

Preprocessed raw text using regex, tokenization, lemmatization, and stopword removal with NLTK and spaCy, enhancing quality of downstream NLP models.

Automated ETL processes with SQL (Oracle, MySQL) and Python scripts, reducing manual data preparation time by 70%.

Performed feature engineering with Pandas and NumPy, applying binning, scaling, encoding, and polynomial transformations for improved ML performance.

Conducted correlation analysis and Chi-square tests in R and Python, selecting significant features that boosted classification accuracy.

Applied data imputation techniques (mean, mode, and KNN imputation) using scikit-learn and R’s mice package, improving model robustness with incomplete data.

Built customer churn prediction models using K-Means clustering in Python, providing retention strategies through Seaborn and Matplotlib visualizations.

Applied dimensionality reduction techniques including PCA in R and Python, optimizing training time for high-dimensional datasets.

Tracked and evaluated model performance with confusion matrix, ROC-AUC, F1-score, and cross-validation in scikit-learn, ensuring reliability of deployed models.

Documented machine learning workflows in Jupyter Notebooks and Markdown, ensuring reproducibility and smooth handover of deliverables.

Environment: Python, scikit-learn, NLTK, spaCy, Gensim, Oracle, MySQL, R, Tableau, Pandas, NumPy, JIRA, Git, Bitbucket

Contact this candidate