Senior Data Scientist / GenAI Specialist

Location:

Eagan, MN, 55121

Posted:

February 14, 2025

Contact this candidate

Resume:

JOSEPH NYAB

651-***-**** **.******.******@*****.***

Data Scientist / Specialist GenAI

Microsoft Certified: Data Analyst Associate AI Engineer Azure Data Engineer

PROFILE SUMMARY

Microsoft Certified “Data Analyst Associate & AI Engineer” with 13+ years in IT and 11 years in Data Science. Specialized in artificial intelligence (AI), data mining, deep learning, predictive analysis, and machine learning (ML-Ops), handling structured and unstructured datasets.

Data Science

Exploratory Data Analysis (EDA): Performed EDA to identify business patterns using visualization tools such as Matplotlib, Seaborn, and Plotly.

Data Visualization & Automation: Automated recurring reports with SSIS, SSRS, SQL, and Python, developing dynamic dashboards on Tableau and Power BI for enhanced data insights.

Programming & Data Engineering: Proficient in Python, Bash, R, SAS, and SQL, including experience with PySpark, Hadoop, Hive, HDFS, and MapReduce for big data analytics.

Statistical & ML Modeling: Strong in predictive modeling, ensemble learning (Bagging, Boosting, Stacking), statistical analysis, factor analysis, ANOVA, hypothesis testing, and econometric techniques.

Big Data & Cloud

Big Data & Cloud Expertise: Extensive experience in big data technologies across AWS, Google Cloud, and Azure, working with large datasets using Hadoop, Data Lakes, Redshift, and NoSQL.

MLOps & CI/CD Pipelines: Built and deployed ML models using REST APIs and WebSockets, integrating with Kubernetes, Docker, Jenkins, and CI/CD workflows across AWS, GCP, and Azure.

LLM Development & Cloud AI: have leveraged AWS Bedrock, Sagemaker, OpenAI, and Azure Cognitive Search for scalable AI solutions, including domain-specific fine-tuning models.

Gen-AI Expertise

Time Series & NLP Specialization: Proficient in executing AI solutions using Generative AI and NLP frameworks such as Langchain, Hugging Face, NLTK, spaCy, word2vec, sentiment analysis, Named Entity Recognition (NER), topic modeling, Optical Character Recognition (OCR), vector databases (Pinecone, FAISS, Azure Index), and time series forecasting using ARIMA, LSTM, RNN, and Prophet.

Conversational AI & RAG: Expertise in developing high-performing RAG-based chatbots and multi-modal AI agents, fine-tuning LLMs such as BERT, GPT-3/3.5/4, and leveraging Azure OpenAI Studio.

AI System Architecture: Designed and deployed end-to-end AI systems integrating ML models, ETL pipelines, APIs, OpenAI API, and cloud infrastructure for enterprise applications.

GPU & High-Performance Computing: Hands-on experience in CUDA/GPU acceleration for AI model training, inference, and real-time image/document processing.

Data Safety & Leadership

Data Security & Compliance: Designed and built data products focused on security, insider risk management, compliance, and lifecycle management with Copilot for Security innovations.

Agile & DevOps Practices: Experienced in bug tracking, version control, and project management tools like Jira, Git, and CI/CD pipelines.

Leadership & Stakeholder Management: Led teams develop ML models, APIs, and data pipelines, collaborating with stakeholders across healthcare, banking, insurance, research, and energy sectors.

Soft Skills & Adaptability: Strong communication, interpersonal, analytical, and leadership skills, with the ability to quickly master and apply new technologies.

CERTIFICATIONS

Microsoft Certified: Data Analyst Associate

Microsoft Certified: Power Platform Associate

Microsoft Certified: AI Engineer

Microsoft Certified: Azure Data Engineer

TECHNICAL SKILLS

Programming & Scripting Languages:

Python, R, Scala, Java, C, C++, C#, Kotlin, Objective-C, Perl, Ruby, MATLAB, SQL, HiveQL, Bash, Shell Scripting

Data Engineering & ETL:

Apache Spark (PySpark, SparkSQL, Structured Streaming), Apache Flink, Kafka, Airflow, Hive, Sqoop, Logstash, SSIS, Informatica, AWS Glue, Azure Data Factory, Google Data Fusion

Data Science & Machine Learning Frameworks:

TensorFlow, Keras, PyTorch, Scikit-learn, XGBoost, LightGBM, H2O.ai, FastAI, CatBoost

Data Science & Analytics Libraries:

NumPy, Pandas, SciPy, Dask, Vaex, Matplotlib, Seaborn, Plotly, ggplot2, NetworkX, Statsmodels, H2O, Theano, Deeplearning4j, EJML

Data Visualization & BI Tools: Tableau, Power BI, Looker, Apache Superset, D3.js

Big Data & Cloud Platforms:

AWS (S3, Redshift, Glue, Kinesis, EMR, Lambda, Step Functions), Azure (Data Lake, Synapse, Cosmos DB, Data Factory), GCP (BigQuery, Dataflow, Dataproc, Composer), Snowflake, Databricks, HDFS

Databases & Storage:

NoSQL (MongoDB, Cassandra, DynamoDB, HBase, CouchDB), Relational (PostgreSQL, MySQL, SQL Server, Oracle, Teradata), Data Warehousing (Snowflake, Redshift, BigQuery)

AI & Natural Language Processing (NLP):

SpaCy, NLTK, Transformers (Hugging Face, BERT, GPT-4, PaLM), Word2Vec, FastText, TF-IDF, LDA, NER, Sentiment Analysis, Text Summarization

Deep Learning & Computer Vision:

CNN, RNN, LSTM, GANs, Faster R-CNN, YOLO, OpenCV, Detectron2, Transfer Learning, Diffusion Models

Statistical & Analytical Techniques:

Bayesian Analysis, Regression (Linear, Logistic, Multivariate), Gradient Descent, Stochastic Optimization, Clustering (K-Means, DBSCAN, Hierarchical), Forecasting (ARIMA, Prophet), PCA, Hypothesis Testing, Markov Chains

Development & DevOps:

Git, GitHub, GitLab, Bitbucket, Jenkins, Docker, Kubernetes, Terraform, Ansible, CI/CD (GitHub Actions, Azure DevOps, CircleCI, ArgoCD)

MLOps & Model Deployment:

MLflow, Kubeflow, TensorFlow Serving, FastAPI, Flask, AWS SageMaker, Google Vertex AI, Azure ML Studio

Security & Compliance:

Data Governance, Encryption, Insider Risk Management, SOC 2, GDPR, HIPAA, IAM, Role-Based Access Control (RBAC)

Development Tools & IDEs:

Jupyter, PyCharm, IntelliJ, VS Code, Eclipse, Xcode, Android Studio, Spyder, Atom

Automation & Workflow Orchestration:

Apache Airflow, Prefect, AWS Step Functions, Google Cloud Composer, Kubernetes Jobs

WORK EXPERIENCE

Lead Data Scientist/ GenAI Specialist Mar 2023 – Present

Devoted Health, Eagan, MN

At Devoted Health, I led the development of GenAI-powered solutions, integrating LLMs, NLP, and deep learning to enhance customer engagement, automate workflows, and optimize business processes. I worked on RAG-based AI assistants, multi-modal AI models, predictive analytics, and recommendation systems for Medicare clients. My role involved fine-tuning LLMs, deploying AI models on AWS & Azure, leveraging Kubernetes for scalability, and implementing MLOps pipelines to ensure seamless model deployment and monitoring.

Key Contributions:

Developed and deployed multi-modal AI assistants, incorporating LangGraph and NLP-driven retrieval models to improve user interactions.

Implemented state-of-the-art sentiment analysis models for call center transcripts, providing actionable insights to enhance member experience.

Optimized RAG-based chatbots with prompt engineering, LoRA fine-tuning, and retrieval optimization, improving response accuracy and reducing hallucinations.

Built an image-to-text captioning system using Vision Transformers (ViT) to automate document processing workflows.

Developed predictive models (Random Forest, XGBoost, PTC) to identify high-risk members, enabling proactive interventions and personalized care plans.

Automated claims processing using K-means clustering and KNN algorithms, reducing processing time by 30% and minimizing manual errors.

Enhanced OCR-based document processing by improving Tesseract OCR accuracy for extracting insights from scanned documents.

Optimized AI model training using NVIDIA GPUs, reducing inference time by 85% for deep learning-based image classification.

Built a recommendation engine for Medicare plan selection, leveraging similarity-based KNN models and engagement scores for personalized recommendations.

Leveraged PySpark and SQL to process large-scale healthcare data, performing complex multi-table joins and compliance assessments.

Implemented Generative Adversarial Networks (GANs) to generate synthetic training data, improving model robustness and generalization.

Integrated generative AI models into MLOps workflows, ensuring efficient model deployment, monitoring, and continuous improvement.

Tech Stack: LLMs, RAG, LangGraph, OpenAI API, Hugging Face, TensorFlow, PyTorch, Keras, GANs, Vision Transformers (ViT), PySpark, SQL, NLTK, Spacy, Tesseract OCR, Transformer Models, Named Entity Recognition (NER), AWS (S3, Lambda, SageMaker, Redshift), Azure (ML Studio, Cognitive Services), Kubernetes, Docker, CI/CD, MLOps, Hadoop, Hive, Impala, Snowflake, PostgreSQL, MongoDB, Tableau, Power BI, Matplotlib, Seaborn

Sr. Data Science Consultant Jan 2022 – Mar 2023

Conoco Phillips, Houston, TX

At ConocoPhillips, I designed and deployed ML-driven predictive models, AI assistants, and automated data pipelines to optimize drilling efficiency, asset management, and environmental impact analysis. I developed time series forecasting models, anomaly detection systems, and AI-powered decision-making workflows using AWS, GCP, and Snowflake. My role also included seismic data processing, geospatial analysis, and real-time analytics, ensuring compliance with GDPR, environmental, and internal data regulations.

Key Contributions:

Optimized drilling operations with an AI-powered workflow, increasing vertical rate of penetration by 20%, leading to significant cost savings.

Developed RAG-based AI assistants for quick interpretation of asset manuals and operational insights retrieval, improving field operations.

Built ML-based energy demand forecasting models using ARIMA and Prophet, enhancing production planning and market alignment.

Designed computer vision models leveraging satellite imagery and field data to monitor environmental impact and ensure compliance.

Implemented real-time anomaly detection models (Streaming KMeans, ARIMA, LSTM) to enhance drilling safety and efficiency.

Utilized SparkSeis ML platform for seismic data processing, improving reservoir imaging and geological insights.

Optimized Snowflake-based data pipelines for seamless ML model integration and big data analysis.

Developed ML models to analyze gas composition data, reducing operational losses by 25% and increasing production efficiency.

Leveraged GCP BigQuery and Vertex AI to analyze vendor data, providing data-driven insights for case teams.

Automated geophysical and seismic data processing, reducing manual efforts by 40% and improving reservoir simulation accuracy.

Converted REST APIs to GraphQL for efficient ingestion of geospatial maritime vendor data.

Employed regex-based dynamic pattern matching for S3 directories, automating data ingestion pipelines.

Designed unsupervised ML techniques to analyze nuclear magnetic resonance (NMR) log data, enhancing resource extraction strategies.

Containerized ML applications using Docker, ensuring smooth deployment and integration with existing workflows.

Actively participated in Agile workflows using JIRA, collaborating with vendors to evaluate data integration proposals.

Tech Stack: ARIMA, Prophet, LSTM, Streaming KMeans, Random Forest, RAG, OpenAI API, Seismic ML (SparkSeis), PySpark, Snowflake, Hadoop, Hive, Impala, GCP BigQuery, AWS S3, Vertex AI, AWS (SageMaker, Lambda, Redshift), GCP (BigQuery, Vertex AI), Azure, Kubernetes, Docker, Airflow, CI/CD, REST APIs, GraphQL, Terraform, SQL, NoSQL, Geospatial Data Processing

Sr. Data Scientist Oct 2021 – Jan 2022

Credit Suisse, New York City, NY

At Credit Suisse, I led the development of a computer vision-based intelligent document processing system that automated financial report extraction, increasing processing speed and accuracy to 99%. I integrated OCR, NLP, and deep learning models (CNNs, LSTMs, BERT) to classify and extract data from regulatory documents, leveraging NVIDIA GPUs, CUDA, and TensorFlow for high-performance parallel processing. Additionally, I collaborated with cross-functional teams across the US, Europe, and India, optimizing SQL-based data pipelines and Hive-based trade error processing, significantly enhancing compliance and operational efficiency.

Key Contributions:

Developed a computer vision model that automated classification of financial and regulatory documents, reducing document retrieval time by 40%.

Optimized OCR accuracy using Tesseract, OpenCV, and Python-based scaling/zooming, achieving 99% precision in document text extraction.

Employed NVIDIA GPUs and CUDA for parallel processing, enabling real-time inference and rapid area-of-interest (AOI) detection/segmentation.

Built an NLP pipeline using BERT & LSTMs for document categorization and topic modeling, enhancing automated document tagging and retrieval.

Implemented deep learning frameworks (PyTorch, Keras, TensorFlow) with batch normalization, improving document processing efficiency.

Developed JSON schemas for structured extraction of key data fields from tax forms (e.g., W9, W8BENE, ECI) using OpenCV-based homographic alignment.

Applied Transfer Learning with YOLO models to detect specific fields in tax documents, refining accuracy and feature detection.

Orchestrated multi-threaded Python workflows, reducing document extraction time across multi-page tax forms.

Built SQL-based trade error pipelines using Hive and Impala, automating data ingestion for regulatory visualization in Tableau.

Collaborated with European clients & India IT teams, managing cross-time-zone data integrations and query optimizations.

Automated regular expression-based text processing, filtering OCR noise and improving data reliability for financial compliance reports.

Tech Stack: TensorFlow, PyTorch, Keras, OpenCV, CUDA, YOLO, LSTMs, BERT, Tesseract, Python, NumPy, SciPy, Pandas, Regex, Hive, Impala, MongoDB, SQL, JSON, Tableau, NVIDIA GPUs, Parallel Processing, Multi-threading

Sr. Data Scientist Dec 2019 – Oct 2021

Cleveland-Cliffs, Cleveland, OH

At Cleveland-Cliffs, I developed predictive models for maintenance, supply chain, and inventory management, along with a real-time anomaly detection system that improved operational efficiency. I automated data analytics dashboards in Power BI, reducing analysis time by 45%, and built robust data pipelines leveraging Azure Blob Storage, Snowflake, Databricks, and Kafka. Additionally, I remotely tested predictive models at the Peru Mine, collaborating with cross-functional teams to optimize steel production, machine maintenance, and mining operations, leading to significant cost savings and efficiency gains.

Key Contributions:

Designed and deployed real-time data streaming solutions using Kafka, Snowpipe, and Azure Blob Storage, enabling seamless data ingestion.

Developed predictive maintenance models using RNNs and Random Forest, reducing steel production downtime by 20%.

Built time-series steel demand forecasting models on AzureML (ARIMA, XGBoost), saving the company $2 million annually.

Implemented IoT-driven predictive models using GPS, temperature, pressure, and speed data, optimizing manufacturing efficiency.

Created REST API & WebSocket applications for remote mining operations control, improving real-time decision-making.

Developed interactive Power BI dashboards, visualizing data from CosmosDB JSON datasets for real-time operational insights.

Integrated AI/ML models for energy consumption optimization, supporting environmental sustainability initiatives in steel production.

Tested and validated predictive models remotely at Peru Mine, collaborating with global teams for real-world deployment.

Built machine failure prediction models using KNN and Logistic Regression, increasing equipment reliability.

Utilized Monte Carlo simulations to assess the accuracy of mine KPI estimates, refining operational forecasting.

Architected AI-driven solutions for shovel material block depletion prediction, improving resource allocation.

Performed exploratory data analysis (EDA) with PySpark, NumPy, Pandas, extracting actionable insights from diverse datasets.

Participated in Agile-driven stand-ups and Azure DevOps sprints, ensuring continuous improvement in AI deployments.

Researched and documented DevOps best practices, enhancing defect detection, product quality, and customer satisfaction.

Tech Stack: AzureML, RNN, Random Forest, ARIMA, XGBoost, KNN, Logistic Regression, Snowflake, CosmosDB, Azure Blob Storage, Teradata, Databricks, Kafka, Azure Streaming Data, Power BI, REST API, WebSockets, JSON, Postman, Python, PySpark, Pandas, NumPy, SQL, Git, Agile, Azure DevOps

Sr. Data Scientist Jul 2018 – Nov 2019

KeyCorp, Cleveland, OH

At KeyCorp, I led the data science team to develop a portfolio risk dashboard covering all aspects of the credit life cycle for retail unsecured loans. I leveraged time series forecasting (Prophet, ARIMA, SARIMAX) to predict default rates, optimized Big Data workflows with Hadoop and Spark, and tackled unbalanced datasets using SMOTE and KNN imputation. Additionally, I contributed to security projects, implementing real-time object tracking and classification with OpenCV and Google Cloud Video Intelligence.

Key Contributions:

Developed a credit risk dashboard covering default rate predictions, loan performance, and credit life cycle analytics.

Designed time-series forecasting models using Prophet, ARIMA, and SARIMAX, enhancing predictive accuracy for loan defaults.

Managed and analyzed structured & semi-structured data from multiple sources using Hadoop, HDFS, Spark, and MapReduce.

Addressed unbalanced data challenges with SMOTE and KNN imputation, improving model accuracy and fairness.

Applied ML techniques (classification, regression, clustering, dimensionality reduction) using MLLib, Spark, and Python.

Developed supervised, unsupervised, and semi-supervised models for document analysis and anomaly detection.

Led security analytics projects by implementing real-time object tracking and classification with OpenCV.

Enhanced video searchability & discoverability using Google Cloud Video Intelligence API.

Collaborated with application engineering & data science teams to ensure smooth model deployment & data integration.

Optimized Spark/Scala and Python projects, leveraging regular expressions (regex) in a Hadoop/Hive environment.

Analyzed model performance and robustness, ensuring algorithmic accuracy and validity in financial risk prediction.

Communicated complex technical solutions effectively within cross-functional teams, driving informed decision-making.

Tech Stack: Prophet, ARIMA, SARIMAX, MLLib, SMOTE, KNN Imputation, Hadoop, HDFS, Spark, MapReduce, Hive, Google Cloud Video Intelligence, OpenCV, Real-time Object Tracking, Video Analytics, Python, Spark/Scala, Regex, Git, Agile

Sr. Data Scientist Feb 2015 – July 2018

Aetna Insurance, Hartford, CT

At Aetna, I developed personalized product recommendation systems using machine learning to enhance customer satisfaction and client acquisition. I designed and deployed automated customer segmentation models, enabling risk-based insurance personalization while ensuring high-quality care. My role involved ML model development (Logistic Regression, Random Forest, Neural Networks), EDA (R, Python, Tableau), and data pipeline optimization (SQL, ETL, Spark), collaborating closely with data engineers to ensure data integrity and efficiency.

Key Contributions:

Designed and deployed a recommendation system using collaborative filtering, boosting customer engagement and retention.

Built automated customer segmentation models, enabling Aetna to offer tailored insurance plans for high-risk clients.

Developed and optimized ML models including Logistic Regression, Random Forest, KNN, SVM, Neural Networks, Linear & Lasso Regression, and K-Means Clustering.

Engineered feature selection and data transformations to enhance ML model performance and accuracy.

Conducted EDA and built interactive visualizations using R, Python, and Tableau, improving data-driven decision-making.

Optimized ETL workflows and data extraction by writing efficient SQL queries for Oracle databases.

Leveraged Spark for scalable ML pipelines, improving data processing efficiency for large insurance datasets.

Performed data validation, cleaning, and preprocessing to ensure accuracy in predictive analytics.

Applied advanced statistical ML techniques for forecasting, classification, and customer behavior modeling.

Collaborated cross-functionally with data engineers, actuaries, and business stakeholders to implement AI-driven solutions.

Tech Stack: Logistic Regression, Random Forest, KNN, SVM, Neural Networks, Lasso Regression, K-Means, R, Python, Spark, SQL, Tableau, Oracle, Spark, SQL, ETL Pipelines

Sr. Machine Learning/NLP Engineer Oct 2013 – Feb 2015

Sisense, New York City, NY

At Sisense, I developed scalable NLP models for information extraction, topic modeling, and predictive analytics, successfully processing and tokenizing 1.6 million sentences. I implemented advanced NLP techniques (TF-IDF, Word2Vec) and leveraged OLAP cubes for behavioral segmentation, data mining, and predictive modeling. My work in machine learning, statistical analysis, and visualization enhanced data-driven decision-making and product intelligence.

Key Contributions:

Developed and optimized NLP pipelines for tokenization, parsing, and relationship extraction using NLTK, Word2Vec, and TF-IDF.

Processed and tokenized 1.6M sentences, improving text classification accuracy and search relevance.

Designed and deployed scalable production NLP models, ensuring efficient model maintenance and versioning.

Implemented OLAP cubes for predictive modeling, enabling customer behavior segmentation and insight extraction.

Developed neural networks and cluster analysis models using R and SAS to identify key data patterns.

Applied ML algorithms (K-Means Clustering, Gaussian Distribution, Decision Trees) for classification and pattern recognition.

Leveraged Python for ML pipelines, implementing regression models, random forests, and ensemble techniques.

Created interactive data visualizations to highlight key insights, supporting business and engineering teams.

Built and optimized machine learning data pipelines, enhancing model scalability and deployment efficiency.

Utilized statistical methods (inferential statistics, bootstrap aggregation) to improve predictive accuracy.

Tech Stack: NLTK, Word2Vec, TF-IDF, Scikit-learn, R, SAS, OLAP, Python, SQL, Spark, K-Means Clustering, Gaussian Distribution, Decision Trees, Regression Models

Data Scientist Feb 2012 – Oct 2013

Ross Stores, Dublin, CA

At Ross Stores, I developed and deployed machine learning models to enhance anomaly detection, document classification, and predictive analytics. Working in an Agile development environment, I utilized Python, R, and SAS to process and visualize large datasets. My key contributions included implementing CNN-based anomaly detection, designing real-time cloud data pipelines, and integrating machine learning solutions on Azure, improving operational efficiency and data-driven decision-making.

Key Contributions:

Developed anomaly detection algorithms using Convolutional Neural Networks (CNNs) to enhance fraud detection and operational efficiency.

Designed document classification models using Random Forest, Neural Networks, KNN, K-means clustering, and Logistic Regression for text analytics.

Built predictive analytics models with Decision Trees, SVM, and Random Forest, improving forecasting accuracy.

Engineered scalable ML models in Python (Pandas, NumPy, Scikit-learn) and R, automating complex analytics workflows.

Developed and deployed machine learning solutions on Microsoft Azure, enabling AI-driven insights at scale.

Led a cross-functional Business Intelligence group, architecting near-real-time cloud and traditional data systems for streamlined data processing.

Implemented SQL-based data retrieval and transformation pipelines, integrating search engines for optimized querying.

Conducted in-depth exploratory data analysis (EDA) to interpret noisy datasets and extract actionable insights.

Automated ML model scheduling in SAS, ensuring seamless integration with MSSQL databases.

Tech Stack: CNNs, Random Forest, Decision Trees, SVM, Logistic Regression, Python, R, SAS, SQL, Azure, Pandas, NumPy, Scikit-learn, Matplotlib

Data Consultant Aug 2011 – Jan 2012

Harbor Freight Tools, Calabasas, CA

Key Contributions:

Analyzed sales data for trends & seasonal patterns, for inventory management and purchasing decisions.

Developed reports to track KPIs such as sales, profit margins, and customer demographics.

Collaborated with store managers to understand their data needs and provide tailored insights.

Utilized data visualization tools to create clear and concise reports and presentations.

Implemented data quality checks to ensure data accuracy and consistency.

Assisted in the design and implementation of a new point-of-sale system.

Developed a system to track inventory levels and identify potential stockouts.

Provided training for store managers on how to use data to make informed decisions.

EDUCATION

Bachelor of Science in Computer Science - Grand Valley State University

Master of Science in Data Science & Analytics - Grand Valley State University

Contact this candidate