Senior Data Scientist

Location:

Chicago, IL

Posted:

November 11, 2024

Contact this candidate

Resume:

Jean Baptiste

Senior AI Data Scientist

Phone: 217-***-****

Email: *******************@*****.***

Profile Snapshot

Dynamic Data Scientist with over 15 years in IT and 12+ years specializing in Data Science, Machine Learning, and Artificial Intelligence, consistently pioneering transformative solutions across industries. Skilled in predictive analytics, including the design of recommender systems, predictive modeling, and forecasting.

•Programming & Machine Learning: Proficient in Python (Pandas, NumPy, TensorFlow, Keras, Scikit-learn, Seaborn, Matplotlib) to build diverse machine learning models, including logistic regression, gradient-boosted decision trees, and neural networks.

•Data Processing: Strong capabilities in data acquisition, validation, and predictive modeling, ensuring model reliability and accuracy.

•Natural Language Processing (NLP): Extensive experience with NLP techniques, such as information extraction, topic modeling, parsing, and relationship extraction, using NLTK and SpaCy.

•NLP Model Deployment: Skilled in developing, scaling, and deploying NLP models for production, optimizing for efficiency and scalability.

•Feature Engineering: Expertise in advanced techniques (PCA, feature normalization, label encoding) using Scikit-learn.

•Model Optimization: Leveraged cross-validation and hyperparameter tuning to avoid overfitting and enhance model performance.

•Distributed Systems: Built Python-based distributed random forests using PySpark and MLlib.

•Advanced Techniques: Applied an array of machine learning methods, including Naïve Bayes, regression analysis (linear, logistic), neural networks (RNN, CNN), transfer learning, time-series analysis, and random forests.

•Interactive Visualizations: Developed interactive visual tools and dashboards in Python (Matplotlib, Plotly, Seaborn) and R (dplyr, tidyverse, Shiny).

•Custom BI Solutions: Created tailored BI reporting dashboards with Dash and Plotly, transforming insights into actionable recommendations.

•Comprehensive Reporting: Generated detailed reports on model performance using Tableau, supporting data-driven decision-making.

•AWS Services: Integrated AWS services (S3, DynamoDB, Lambda, EC2) for secure data storage and seamless model deployment.

•Automated ML Pipelines: Implemented automated machine learning pipelines using MLflow and Python, streamlining model development, deployment, and monitoring, while leveraging AWS Lambda for serverless execution and scalability.

•Business-Data Alignment: Translated business needs into analytical models using Python and TensorFlow, aligning model outcomes with stakeholder expectations.

•Custom BI Development: Built Power BI solutions tailored to meet business specifications, adding value through precise, customized data insights.

•Requirements Gathering: Conducted stakeholder interviews, workshops, and documentation reviews to define processes, identify risks, and ensure alignment.

•SQL & Data Analysis: Proficient in SQL and managing relational databases, applying statistical techniques to enhance analysis accuracy for both supervised and unsupervised learning tasks.

•Pattern Discovery & Insights: Demonstrated capability in uncovering data patterns using algorithms, visual tools, and informed intuition.

•Iterative Improvement: Used experimental, iterative approaches to validate findings, refining models for maximum effectiveness.

Technical Skills

Programming Languages: Analytic programming using Python (NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, PyTorch, Keras). R (Tidyverse, Ggplot2, Dplyr, Purrr, Tidyr, and more)

Analytic Scripting Languages: Python, R, MATLAB

IDEs: RStudio, PyCharm, Visual Studio, Visual Studio Code, Jupyter Notebook, Sublime, XCode, MATLAB_R2021b, Eclipse

Database, Query, Data Cleaning, and Normalization: PostgreSQL, MySQL, SQL Server, RDS, RedShift, MongoDB, DynamoDB, MS Excel, MS Access

Machine Learning Methods: Applying classification, regression, prediction, dimensionality reduction, and clustering to solve challenges in retail, manufacturing, and market science, using methods like Linear Regression, Logistic Regression, Random Forest, XGBoost, KNNs, and Deep Learning in Python.

Deep Learning Methods: Artificial Neural Networks, CNN, LSTMs, Gradient Descent variants (including ADAM, NADAM, ADADELTA, RMSProp), Regularization Methods, and Training Acceleration with Momentum Techniques TensorFlow, PyTorch, Keras

Artificial Intelligence: Text understanding, classification, pattern recognition, recommendation systems, targeting systems, ranking systems, and analytics.

Statistical Analysis: A/B Testing, ANOVA, T-Test, Model Selection, Anomaly Detection, Case Diagnostics, and Feature Selection in R or Python for analysis of data.

Analytics: Research, analysis, forecasting, and optimization to improve the quality of user-facing products, Probabilistic Modeling, and Approximation Inference. Advanced Data Modeling, Predictive, Statistical, Sentiment, Exploratory, Stochastic, Bayesian Analysis, Inference, Models, Regression Analysis, Linear models, Multivariate analysis, Sampling methods, Segmentation, Clustering, Sentiment Analysis.

Cloud: Extensively used Cloud for model development, deployment and maintenance using AWS, GCP and Azure.

Professional Experience

Lead AI Engineer

Analytics8, Chicago, Illinois April 2022 - Present

In my role as Lead AI Engineer at Analytics 8, I developed a custom algorithm for document ingestion that seamlessly converts, segments, and prepares documents for embedding into a vector database. Leveraging Langchain and OpenAI's ADA002 model, I optimized the processing of PDF, HTML, and text files. I also designed a chunk provenance scheme to enhance data traceability and created a Context Retrieval Class in Python with Pinecone. Additionally, I applied advanced prompt engineering techniques for large language models (LLMs) and deployed the entire system through a Jenkins and Docker CI/CD pipeline, ensuring scalability, efficiency, and reliable service delivery.

•Engineered a bespoke algorithm for document processing, capable of converting, segmenting, and preparing documents for embedding, followed by efficient upsert into a vector database.

•Utilized Langchain and OpenAI’s ADA002 model to effectively split and embed various document formats, including PDFs, HTML, and text files.

•Created a unique chunk provenance scheme embedded in the Vector DB metadata, enhancing data traceability and manageability.

•Built a Python-based Context Retrieval Class with Pinecone, facilitating streamlined and precise data retrieval.

•Designed advanced prompt engineering techniques to generate tailored system and user prompts for use within large language models (LLMs).

•Developed robust functions that integrate retrieved data into LLM completions, ensuring responses are accurate and contextually relevant.

•Assessed LLM outputs using quality metrics like BLEU scores, perplexity, and diversity to maintain high standards in generated content.

•Conducted in-depth evaluations of Retrieval-Augmented Generation (RAG) to verify relevance and accuracy, ensuring the reliability of generated information.

•Deployed the complete system with Jenkins and Docker, establishing a CI/CD pipeline for seamless and reliable updates.

•Built a scalable microservice using Flask and Gunicorn for efficient service delivery.

•Presented technical findings and outcomes to stakeholders, translating complex concepts into clear, actionable insights.

•Integrated the latest advancements in AI, continually incorporating new techniques and models to keep the system cutting-edge.

•Implemented robust data security measures to protect sensitive information within AI systems, ensuring data privacy.

•Performed comprehensive scalability and performance tuning, enabling the system to handle increased load while maintaining quality output.

•Collaborated closely with data scientists, engineers, and product managers to align AI capabilities with business objectives and user needs.

•Focused on enhancing the user experience by integrating intuitive interfaces and ensuring the seamless functionality of AI-driven features.

Lead Data Scientist

Bristol-Myers Squibb., Lawrence NJ Feb 2020 – Mar 2022

As Lead Data Scientist at Bristol-Myers Squibb, I spearheaded initiatives to advance user experiences by leveraging innovative technologies across three key projects. We improved search capabilities with advanced NLP methods, fine-tuning BERT and ELMO models and accelerating processing with parallel computing and GPU frameworks. In medical cost prediction, we implemented Machine Learning and Deep Learning algorithms, optimizing routines and enhancing forecast accuracy. Additionally, we built a recommender system that personalized service suggestions based on user interactions and clinical history, using data from clicks, logs, and demographics to deliver tailored recommendations.

•Performed comprehensive analysis on data insights and statistics for Medicare and Medicaid specialties and procedures.

•Applied a variety of visualization methods, including histograms, pie charts, box-and-whisker plots, and distribution curves, for detailed assessment of variable distributions.

•Utilized tools such as NLTK, Gensim, SpellChecker, Spello, SymSpell, TextBlob, Re, and BERT-based sentence_transformers to normalize user searches effectively.

•Championed the integration of Generative AI techniques to synthesize data, expand training datasets, and diversify data distributions, strengthening machine learning model resilience.

•Leveraged advanced language models like GPT-4 within the LangChain Framework to implement sophisticated decoding strategies and prompt engineering for generating context-rich text.

•Drove AI model development, exploratory data analysis, and modernization efforts for GPT-4 and GPT-4 Vision, fostering innovation and technological growth within the organization.

•Automated deployment pipelines with Azure DevOps, ensuring smooth software application delivery to enhance user experiences.

•Implemented robust security protocols across Azure resources, including Azure Active Directory and Azure Key Vault, to protect healthcare data privacy and uphold compliance standards.

•Hosted bots on Azure OpenAI Studio and engineered scalable, cloud-centric solutions using Azure’s platform to boost the resilience of digital personalization initiatives.

•Used Optuna for model tuning, visualizing KPI outcomes and processing times through Matplotlib and Seaborn.

•Refactored code from Notebooks to Python Classes and Methods, managed version control with Bitbucket, and documented implementation details in SharePoint for stakeholders.

•Ensured search engine accuracy and efficiency through thorough QA and UAT testing, complemented by debugging and troubleshooting.

•Contributed to a dynamic technical environment, utilizing programming languages like Python, Linux, and C/C++ to drive forward Anthem Inc.’s mission with innovation and excellence.

Sr. Data Scientist

Centene Corporation in St. Louis, MO Aug 2018 – Jan 2020

As the Lead Data Scientist at Centene Corporation, I spearheaded the initiative to enhance their long-term care business data handling system, which involved a collection of Excel spreadsheets. Our goal was to utilize NLP to analyze text across rows and columns to determine the probability of semantic equivalence. We focused on medical terms and codes to identify instances such as birth deliveries by C-section that may indicate long-term care needs. By implementing an advanced text analysis framework, we generated reports showing the probability that data entries referred to the same concept. This system prioritized cases with mid-range probabilities (around 50%) for human review, while extreme probabilities were deemed lower priority, streamlining the review process and ensuring accuracy in data interpretation.

•Led a multidisciplinary team at Centene Corporation, including a Data Scientist, Project Manager, and three NLP Specialists, overseeing project planning and promoting effective team collaboration.

•Conducted in-depth research on advanced techniques like Regular Expressions, data cleaning frameworks, code string similarity computations, and clustering methodologies.

•Assessed model optimization impacts on performance metrics using A/B testing.

•Analyzed data from Medicare, Medicaid, and Ambetter sources with distribution techniques such as histograms and pie charts for variable insights.

•Utilized LLM for sophisticated text analysis, automating the identification of potential matches in Centene Corporation’s long-term care data.

•Deployed the LLM framework across Excel spreadsheets to estimate similarity probabilities, flagging entries for review based on defined thresholds.

•Applied Regex for pre-processing drug and procedure codes, performed imputation for missing data, and measured sample similarity scores.

•Streamlined workflows with Autogen and the Llama Index framework to quantify linguistic trends and patterns.

•Delivered insights on clustering and feature selection through unsupervised learning methods, including prototype-based and hierarchical clustering.

•Enhanced code efficiency and scalability by restructuring code from Notebooks to Python Classes and Methods.

•Created R&D mock-ups and documented technical implementations in SharePoint for broad stakeholder access.

•Debugged Python code issues manually, employing both object-oriented and functional programming approaches, using Pandas, Numpy, Matplotlib, Seaborn, and the Pyclustering library.

Sr. NLP Engineer

Huntington Bank., Columbus OH Dec 2015 - July 2018

In my role as a Senior NLP Engineer at Huntington Bank, I leveraged advanced data science techniques to address complex banking challenges. Developed and deployed scalable machine learning models using Python and cloud technologies to improve customer experience and operational efficiency. Conducted in-depth data analysis and visualization to uncover actionable insights and support strategic decision-making. Implemented natural language processing techniques to extract valuable information from large volumes of text data, enhancing fraud detection and customer service.

•Retrieved and validated sensitive customer data from production SQL databases, ensuring data integrity and compliance with security regulations.

•Processed and cleaned massive datasets of over 10 million text records, applying advanced text preprocessing techniques to prepare data for model training.

•Leveraged AWS cloud resources for efficient model training and hyperparameter tuning, optimizing model performance and reducing computational costs.

•Developed a variety of machine learning models, including logistic regression, random forests, gradient boosting, and neural networks, using Python libraries like Pandas, NumPy, Seaborn, Matplotlib, and Scikit-learn to address diverse banking challenges.

•Conducted in-depth statistical analysis using Python and R, employing linear regression and advanced statistical techniques to uncover relationships between variables and inform decision-making.

•Performed comprehensive exploratory data analysis (EDA) using techniques like bag of words, K-means, and DBSCAN clustering to identify patterns, anomalies, and insights within the data.

•Utilized Git and GitHub for version control, enabling collaborative development and efficient tracking of code changes.

•Experimented with various embedding techniques, including Universal Sentence Encoder, Doc2Vec, TF-IDF, BERT, and ELMO, to effectively represent text data for natural language processing tasks.

•Developed predictive models to forecast key performance indicators (KPIs) such as customer churn, fraud detection, and loan default, enabling proactive risk management and strategic planning.

•Created compelling data visualizations and reports using Tableau, MS Office, and ggplot2 to communicate insights to stakeholders and drive data-driven decision-making.

•Crafted complex SQL queries to extract valuable insights from data warehousing systems, supporting business analysis and reporting needs.

Data Scientist

KPMG, New York City, NY Sep 2012 - Nov 2015

As a Data Scientist at KPMG, I was instrumental in using data-driven insights to refine business strategies. I conducted sentiment analysis on customer feedback, identifying core satisfaction drivers and implementing targeted enhancements. By developing advanced customer segmentation algorithms, I helped reduce marketing costs by 20% and built efficient data pipelines utilizing SQL, Python, and Hadoop. I also ran A/B testing experiments that boosted campaign ROI by 15% and collaborated on the deployment of recommendation systems to improve personalized customer experiences. My contributions led to a 25% increase in customer engagement and delivered strategic, data-backed recommendations for senior leadership.

•Performed sentiment analysis on customer feedback data, pinpointing key satisfaction drivers and implementing targeted improvement initiatives.

•Developed and implemented advanced customer segmentation algorithms, achieving a 20% reduction in marketing costs through optimized budget allocation.

•Designed and constructed robust data pipelines and databases using SQL, Python, and Hadoop technologies, ensuring data integrity and reliability for analysis.

•Conducted A/B testing experiments to enhance conversion rates, resulting in a 15% improvement in campaign ROI.

•Worked collaboratively with cross-functional teams to create and deploy recommendation systems, improving personalized customer experiences and increasing upsell opportunities.

•Led a cross-functional team in conducting customer segmentation analysis, which resulted in targeted marketing campaigns and a 25% increase in customer engagement.

•Defined project goals, collected data requirements, and developed analytical solutions for market research and customer lifetime value analysis.

•Stayed updated on the latest advancements in data science and machine learning technologies to continuously enhance analytical capabilities.

•Employed natural language processing techniques to analyze customer feedback for sentiment analysis and product improvement.

•Conducted market segmentation analysis to identify distinct customer groups, tailoring marketing strategies accordingly.

•Delivered detailed reports and presentations to senior executives, highlighting key insights and providing actionable recommendations based on data analysis.

•Utilized A/B testing methodologies to evaluate the impact of marketing campaigns on customer behavior, offering data-driven recommendations for optimizing future initiatives.

•Collaborated with cross-functional teams to define project objectives, gather data requirements, and develop analytical solutions.

•Developed customer lifetime value estimation models to forecast future revenue potential, informing customer acquisition and retention strategies.

Data Analyst

Latentview Analytics, San Jose CA April 2008 - Aug 2012

•Revamped database management systems to adapt to the company's evolving needs, promoting agility and innovation.

•Improved customer satisfaction by strategically implementing SQL-driven database tools that streamlined service delivery.

•Leveraged advanced SQL queries and MS Excel reporting to provide actionable business insights, enabling informed decision-making.

•Managed comprehensive project schedules, collaborating with stakeholders to ensure successful product launches.

•Led continuous improvement initiatives in product development and process optimization, enhancing operational efficiency and standardization.

•Guided a multidisciplinary team in executing thorough data cleansing and ETL processes, ensuring data integrity for a new database system.

•Established stringent quality control measures to uphold data consistency and integrity, conducting detailed audits of generated data samples.

•Facilitated smooth transitions from legacy systems to new platforms through proactive inter-departmental coordination and regular progress meetings.

•Oversaw end-user training programs to ensure proficient use of software tools, empowering employees with essential skills.

•Ensured consistent service availability by proactively maintaining the company's online Oracle database infrastructure

Education

Master of Science of AI in Business Data Analytics

Mori University, Goizueta Business School, Atlanta, GA

PhD Program in Environmental Sciences

University of Texas, Arlington

Contact this candidate