Senior Data Scientist

Location:

Dallas, TX, 75208

Posted:

June 19, 2024

Contact this candidate

Resume:

Theo Dushime

Sr. Data Scientist/ Machine Learning/ AI

Contact: 360-***-****

Email: ********************@*****.***

Summary:

Overall 16+ years of experience in IT along with 11+ years of expert involvement in Artificial Intelligence, Data Mining, Deep Learning, Predictive Analysis, and Machine Learning with big datasets of structured and unstructured data; Proficient in managing entire data science project life cycle and actively involved in all the phases of the project life cycle.

•Presently working with Google as a Sr. Data Scientist & ML Engineer with expertise in Data Extraction, Data Modeling, Data Wrangling, Statistical Modelling, Data Mining, Machine Learning, and Data Visualization.

•Academically proficient with MS Data Science from DePaul University, Chicago, Illinois backed by Post Graduation Diploma in Education and Bachelor of Science in Applied Mathematics.

•Demonstrated excellence in Natural Language Processing (NLP) methods, in particular BERT, ELMO, word2vec, sentiment analysis, Name Entity Recognition, and Topic Modelling Time Series Analysis with AR, MA, ARIMA, GARCH, SARIA, LSTM, RNN, Prophet, and ARCH Models.

•Well-versed in using various packages in Python and R like Pandas, NumPy, SciPy, Matplotlib, Seaborn, TensorFlow, Scikit-Learn, and ggplot2.

•Broad experience in 3rd-party cloud resources: AWS, Google Cloud, and Azure

•Possess a deep understanding of applying Naïve Bayes, Regression, and Classification techniques as well as Neural Networks, Deep Neural Networks, Decision Trees, and Random Forests.

•Experience in working with and querying large data sets from big data stores using Hadoop Data Lakes, Data Warehouse, Amazon AWS, Cassandra, Redshift, Aurora, and NoSQL

•Competent in statistical analysis programming languages such as R and Python (including Big Data technologies such as Spark, Hadoop, Hive, HDFS, and MapReduce).

•Proficient in applying statistical analysis and machine learning techniques to live data streams from big data sources using PySpark and batch processing techniques.

•Successful in leading teams to productionize statistical or machine learning models and create APIs or data pipelines for the benefit of business leaders and product managers.

•Good knowledge of creating visualizations, interactive dashboards, reports, and data stories using Tableau and Power BI.

•Familiar with collaborating with algorithm techniques, including Bagging, Boosting, and Stacking

•Hands-on experience with hands-on with PaLM and Open AI Davinci and GPT 2,3,3.5 and GPT-4

•Executing EDA to find patterns in business data and communicate findings to the business using visualization tools such as Matplotlib, Seaborn, and Plotly.

•Experience in Tracking defects using Bug tracking and Version control tools like Jira and Git.

•Strong experience in interacting with stakeholders/customers, gathering requirements through interviews, workshops, and existing system documentation or procedures, defining business processes, and identifying, and analyzing risks using appropriate templates and analysis tools.

•Skilled in statistical methodologies including Hypothetical Testing, ANOVA, Time Series, Principal Component Analysis, Factor and cluster Analysis, and Discriminant Analysis.

•Experience in Agile Methodology and Scrum Process

•Excellent in understanding new subject matter domains and designing and implementing effective novel solutions to be used by other experts.

•Excellent written, oral communication, interpersonal, intuitive, analysis, and leadership skills to explain complex Data Science concepts in easily digestible terms for stakeholders and clients.

Technical Skills:

Libraries:

NumPy, SciPy, Pandas, Theano, Caffe, SciKit-learn, Matplotlib, Seaborn, Plotly, TensorFlow, Keras, NLTK, PyTorch, Gensim, Urllib, BeautifulSoup4, PySpark, PyMySQL, SQAlchemy, MongoDB, sqlite3, Flask, Deeplearning4j, EJML, dplyr, ggplot2, reshape2, tidyr, purrr, readr, Apache, Spark.

Machine Learning Techniques:

Supervised Machine Learning Algorithms (Linear Regression, Logistic Regression, Support Vector Machines, Decision Trees and Random Forests, Naïve Bayes Classifiers, K Nearest Neighbors), Unsupervised Machine Learning Algorithms (K Means Clustering, Gaussian Mixtures, Hidden Markov Models, Auto Encoders), Imbalanced Learning (SMOTE, AdaSyn, NearMiss), Deep Learning Artificial Neural Networks, Machine Perception

Analytics:

Data Analysis, Data Mining, Data Visualization, Statistical Analysis, Multivariate Analysis, Stochastic Optimization, Linear Regression, ANOVA, Hypothesis Testing, Forecasting, ARIMA, Sentiment Analysis, Predictive Analysis, Pattern Recognition, Classification, Behavioral Modeling

Natural Language Processing:

Processing Document Tokenization, Token Embedding, Word Models, Word2Vec, FastText, BagOfWords, TF/IDF, Bert, Elmo, LDA

Programming Languages:

Python, R, SQL, Java, MATLAB, and Mathematica

Applications:

Machine Language Comprehension, Sentiment Analysis, Predictive Maintenance, Demand Forecasting, Fraud Detection, Client Segmentation, Marketing Analysis, Cloud Analytics in cloud-based platforms (AWS, MS Azure, Google Cloud Platform)

Deployment:

Continuous improvement in project processes, workflows, automation, and ongoing learning and achievement

Development:

Git, GitHub, GitLab, Bitbucket, SVN, Mercurial, Trello, PyCharm, IntelliJ, Visual Studio, Sublime, JIRA, TFS, Linux

Big Data and Cloud Tools:

HDFS, SPARK, Google Cloud Platform, MS Azure Cloud, SQL, NoSQL, Data Warehouse, Data Lake, SWL, HiveQL, AWS (RedShift, Kinesis, EMR, EC2, Lambda

Professional Experience:

Sr. Data Scientist/ AI – AT&T, Dallas, Texas Feb 2024 – Now

(Company Profile: AT&T is the world's fourth-largest telecommunications company by revenue and the largest wireless carrier in the United States.)

As a Sr. Data Scientist at AT&T, our project set out to create an AI-driven system to examine cause-and-effect connections, pinpoint crucial factors influencing lead conversion, and offer practical recommendations for effective interventions. We crafted a user-friendly Streamlit application leveraging CausalNex’s Bayesian Network, enabling us to conduct intervention simulations at both macro and micro levels, adjusting feature distributions and gauging their impact on conversion rates. Through our efforts, we unearthed pivotal insights capable of boosting conversion rates by an impressive 5%, delivering substantial value to our business partners.

•Conducted causal analysis on lead conversion data related to AT&T customer interaction lead to identify causal relationships affecting conversion.

•Utilized causal inference methods, such as Bayesian Networks, and Do-Calculus, to understand the impact of various features on lead conversion rates.

•Collaborated with a cross-functional team using Kanban methodology, leveraging domain expertise and machine learning to design and optimize a robust causal graph.

•Ran multiple intervention simulations to alter the distributions of different features and measure the resulting impact on business sales. Identified key factors that could potentially enhance lead conversion rates.

•Discovered key insights to the business through causal analysis, revealing potential improvements to boost lead conversion rates by 5%.

•Leveraged Snowflake for data querying, Azure Databricks to train Bayesian Network models on virtual clusters.

•Developed an interactive Streamlit application to run intervention simulations, visualize causal relationship graphs, and communicate insights to stakeholders.

•Spearheaded initiatives to create synthetic datasets that accurately capture the statistical properties and correlations present in real-world data while preserving privacy and confidentiality.

•Work closely with stakeholders to ensure the ethical and responsible deployment of generative AI models, addressing concerns related to bias, fairness, and transparency.

•Provide mentorship and guidance to junior data scientists and AI practitioners, fostering a culture of continuous learning and professional development.

Technologies Used: Torch, Detectron2, Pillow, NumPy, OpenCV, ResNet, Matplotlib, Selenium, S3, Amazon SageMaker, HQL, Airflow, Streamlit, VS Code, Snowflake, Azure Databricks, Python, Scikit-Learn, CausalNex, ShowWhy, DoWhy.

Sr. Data Scientist & ML Engineer – Google, Mountain View, CA Jun 2023 – Jan 2024

(Company Profile: Google Inc is an American Multinational Technology company focusing on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence.)

As a Sr. Data Scientist at Google, I used NLP for sentiment analysis of customer reviews to gain insight into how pleased customers are with the products of Google Shopping and the overall shopping experience thereby contributing to driving the business growth and enhancing customer engagement through data-driven insights and predictive modelling. We also analysed customer search queries to identify frequently used keywords in those queries and optimized product titles and descriptions to include those keywords to improve the product’s visibility in search results. I successfully developed and deployed NLP models using deep learning techniques resulting in a 20% increase in customer satisfaction scores and a 15% increase in sales.

•Utilized NLP techniques such as sentiment analysis with BERT, text classification with ULM Fit, named entity recognition with Spacy, and Text summarization with GPT-3, to gain insight into customer satisfaction (sentiment analysis score), improve customer experience, and increase sales.

•Analysed customer search queries and product descriptions using NLP and deep learning techniques, such as keyword extraction with TF-IDF, embedding with Word2Vec, topic modelling with LDA, and language translation with Open NMT, to improve search engine optimization (SEO improvement) and product recommendations, resulting in increased visibility and sales.

•Analysed historical customer data and developed a robust model that accurately predicted churn, enabling us to implement proactive retention strategies.

•Developed customer personas based on various attributes, which allowed us to implement targeted marketing strategies and personalized customer experiences.

•Built customer churn predictive models using Logistic Regression, Random Forest, and XG Boost to identify potential customer churn.

•Deployed models on GCP Vertex AI for easy scalability and accessibility.

•Deployed models on Cloud Elasticsearch for improved search capabilities and Cloud API for natural language processing.

•Analysed purchase quantity, purchase probability, and brand choice based on pricing variations, assisting in pricing strategy decisions.

•Employed Deep Learning techniques using TensorFlow and Kera to predict the probability of customer conversion, optimizing our marketing spend.

•Utilized tools such as TensorFlow, PyTorch, NLTK, and Spacy to implement NLP and deep learning techniques in e-commerce applications.

•Achieved significant improvements in customer satisfaction (sentiment analysis score), sales (product recommendation accuracy), and search engine optimization (SEO improvement) by utilizing NLP and deep learning techniques in e-commerce projects and deploying them on GCP for easy scalability and accessibility, using GCP services like BigQuery, Cloud Storage, Dataflow.

•Developed a SARIMAX model to forecast sales of multiple stores and shared a dashboard of the forecasts with store managers to aid in appropriate inventory restocking.

•Extracted valuable insights from big data using data mining techniques using Spark’s Python API (PySpark) and ETL pipelines.

•Containerized the developed models in Docker containers and deployed them on a Kubernetes cluster.

•Applied Feature Engineering for dimensionality reduction and to improve the models’ performance.

•Conducted Market Segmentation using cluster analysis (Hierarchical Clustering and K-Means) and dimensionality reduction techniques (PCA) to effectively segment the market.

•Created engaging data visualizations and live interactive dashboards using Tableau, Power BI, and Plotly for visualizations.

•Collaborated with stakeholders by effectively communicating project progress and results and presenting data-driven recommendations.

Generative AI/ ML Engineer – CareSource, Dayton, Ohio May 2021 – Jun 2023

(Company Profile: CareSource is a nonprofit that began as a managed health care plan serving Medicaid members in Ohio.)

As Generative AI/ ML Engineer, I developed and improved machine learning models to integrate OCR technology into the healthcare provider's existing workflow, resulting in a more streamlined and efficient process for data entry and claims processing. I used OCR technology to extract text from various types of healthcare documents, such as medical records, prescriptions, lab reports, and insurance claims to automate data entry and streamline claims processing.

•Applied an OCR-based system to spontaneously extract patient information from medical documents, which simplified the data entry process and improved patient care.

•Implemented connectionist temporal classification (CTC) loss function to improve the performance of RNN-based OCR models for recognizing text in variable-length documents.

•Utilized OCR and NLP models to automatically extract codes and information required for claims processing, reducing the need for manual intervention, and increasing the speed and accuracy of the claims processing process.

•Developed and implemented advanced OCR models using deep learning techniques like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to extract data from diverse healthcare forms, including prescriptions, lab reports, and patient information forms.

•Deployed OCR and NLP models on Google Kubernetes Engines (GKS) and Google Vertex AI for scalable and efficient processing of healthcare documents.

•Utilized transfer learning techniques to fine-tune pre-trained CNNs and RNNs models on a healthcare-specific dataset, resulting in a significant reduction in training time and an increase in accuracy.

•Developed OCR models that can extract data from different healthcare forms, such as prescriptions, lab reports, and patient information forms, reducing manual data entry by 75%.

•Utilized OCR technology to extract text from insurance claims, reducing the number of denied claims by 20% and increasing the efficiency of the claims processing process.

•Implemented a hybrid OCR system that combines traditional rule-based OCR with deep learning-based OCR for improved accuracy and flexibility in handling different types of documents.

•Created ensembles of CNNs and RNNs models for text recognition in images, resulting in a significant increase in the accuracy and robustness of the OCR system.

•Utilized natural language processing (NLP) techniques such as named entity recognition (NER) and text classification to extract structured data from unstructured text, improving the accuracy of the OCR system.

•Utilized Vertex AI (AML) for model deployment, monitoring, and management of the OCR and NLP models in a production environment.

•Implemented Google Cloud Functions to trigger the OCR and NLP models for real-time processing of incoming healthcare documents.

•Set up Azure Log Analytics for monitoring and troubleshooting the OCR and NLP models in production.

•Used Azure Cognitive Services such as Computer Vision and Text Analytics to enhance the OCR and NLP model’s capabilities.

•Proven track record of staying abreast of the latest advancements in AutoML methodologies and Azure Cognitive Services, actively integrating new techniques and technologies to enhance model performance and stay at the forefront of industry best practices.

•Utilized tools such as TensorFlow, PyTorch, NLTK, and Spacy to implement NLP and deep learning techniques in e-commerce applications.

•Developed custom NLP models for extracting specific information from medical documents such as patient demographics, diagnosis, and treatment information.

•Utilized Azure Blob Storage for storing and managing large amounts of healthcare documents and the OCR output.

•Lead the research and development efforts in implementing state-of-the-art GAN architectures tailored for generating synthetic data to augment existing datasets.

•Design and implement NLG systems to generate human-like text for various applications such as customer service chatbots, automated report generation, and personalized messaging.

•Research and implement reinforcement learning algorithms to optimize dynamic decision-making processes within CareSource operational frameworks, such as network resource allocation and customer service management.

•Lead projects focused on developing advanced image generation models, including Variational Autoencoders (VAEs) and Conditional GANs, for applications such as synthetic image generation, image inpainting, and style transfer.

Data Scientist - Citigroup, New York, NY May 2019 – Apr 2021

(Company Profile: Citigroup Inc. is an American multinational investment bank and financial services corporation headquartered in New York City.)

I worked with their Cyber Security and Fraud departments to improve their credit/debit card fraud detection using Machine Learning Algorithms. My primary function was to design a Data-Driven model that detected all cases of fraud while limiting false positives. Statistical Methods were used and as the datasets involved were highly imbalanced, innovative solutions were implemented to ensure the highest confidence in the predictions and outlier detection.

•Performed Data Cleaning, features scaling, and features engineering using pandas and NumPy packages in Python and built models using deep learning frameworks.

•Used Scikit-Learn’s model selection framework to perform hyper-parameter tuning using Grid Search CV and Randomized Search CV algorithms.

•Stratified imbalanced data to ensure fair representation of the minority data in all data sets used for cross-validation of the model.

•Used R’s dplyr for data manipulation, as well as ggplot2 for data visualization and EDA.

•Extracted data from Hive databases on Hadoop using Spark through PySpark.

•Utilized Scikit-Learn, SciPy, Matplotlib, and Plotly for EDA and data visualization.

•Consulted with regulatory and subject matter experts to gain a clear understanding of information and variables within data streams.

•Built Artificial Neural Network models to detect anomalies and fraud in transaction data.

•Performed data mining and developed statistical models using Python to provide tactical recommendations to business executives.

•Proficient in utilizing AWS SageMaker for end-to-end machine learning model development, from data preprocessing and feature engineering to model training, deployment, and monitoring.

•Demonstrated experience in optimizing and fine-tuning machine learning models on SageMaker, ensuring efficient resource utilization and cost-effectiveness.

•Extensive hands-on experience in designing, implementing, and managing data workflows using Apache Airflow, facilitating seamless orchestration of tasks and dependencies.

•Adept at creating and maintaining DAGs (Directed Acyclic Graphs) in Airflow to automate complex data pipelines, enhancing overall efficiency and reliability in data processing and model training workflows

•Programmed a utility in Python that used multiple packages (SciPy, NumPy, pandas).

•Deployed model using a Flask API stored through a Docker container.

•Evaluated the performance of our model using a confusion matrix, accuracy, recall, precision, and F1 score. Took careful consideration of the recall score.

•Utilized Git for version control on GitHub to collaborate and work with the team members.

•Designed dashboards with Tableau and provided complex reports, including summaries, charts, and graphs to interpret findings to team and stakeholders.

•Worked with Data Engineers for database design for Data Science.

•Developed unsupervised K-Means and Gaussian Mixture Models (GMM) from scratch in NumPy to detect anomalies.

•Employed a heterogeneous stacked ensemble of methods for the final decision on what transaction was fraudulent.

•Use Git to apply version control. Tracked changes in files and coordinated work on the files among multiple team members.

•Implemented a Python-based distributed random forest.

•Used predictive modeling with tools in SAS, SPSS, R, and Python.

Data Scientist – The Coca-Cola Company, Atlanta, GA Mar 2017 - Apr 2019

(Company Profile: The Coca-Cola Company is a total beverage company with products sold in more than 200 countries and territories.)

At Coca-Cola, I led a team of data scientists from around the world to develop demand forecasting and sales prediction models. In this capability, I worked to educate leadership on the role of data science within an organization and connected with stakeholders to ensure that any data science work we performed was both relevant and valuable to the company.

•Successfully developed a demand forecasting model based on IRI syndicated data to assist demand planners with effectively allocating resources.

•Developed POC based on said dataset using NLP and worked with LDA (Latent Dirichlet Analysis) and ELMO (bidirectional LSTM) to extract and cluster key topics.

•Utilized machine-learning models to implement a high-performing demand forecasting framework from scratch.

•Despite an extremely small sample size, the model obtained consistent, high-quality results using hierarchical modeling (MLib/GBT)

•Worked with stakeholders extensively to provide updates, and tailor model characteristics to better assist end-users.

•Advised on how best to modify existing predictive out-of-stock models to accurately forecast for a longer time horizon.

•Combined several disparate data sources, at different granularities (sales data, economic data, SNAP spending, and more) into one master dataset.

•Proficient in utilizing AWS SageMaker for end-to-end machine learning model development, from data preprocessing and feature engineering to model training, deployment, and monitoring.

•Demonstrated experience in optimizing and fine-tuning machine learning models on SageMaker, ensuring efficient resource utilization and cost-effectiveness.

•Extensive hands-on experience in designing, implementing, and managing data workflows using Apache Airflow, facilitating seamless orchestration of tasks and dependencies.

•Programmed a utility in Python that used multiple packages (SciPy, NumPy, pandas).

•Worked in PySpark, Python, on Azure Databricks

•The model is a significant improvement over the baseline, univariate time-series forecasts that were used previously.

•Created dataset leveraging numerous, disparate sources (outside data from customers, internal order data, weather data, and transportation data)

•Used PCA to deal with the very high (hundreds) number of sparse, categorical variables.

•Developed POC which leveraged NLP techniques and clustering algorithms (K-Means) to attribute causes to negative reviews.

•Successfully implemented AutoML solutions to optimize hyperparameters, feature engineering, and model selection, contributing to enhanced model performance and accuracy.

•Used web scraping to create a dataset of Amazon reviews from scratch.

•Worked with a variety of Python packages including NumPy, NLTK, Pandas, Torch/PyTorch, and Regex.

•Built a model to determine orders that are at risk of being delivered late. This would allow individuals along the supply chain to intervene and avoid the associated fees.

ML-Ops Engineer – Nextera Energy, Juno Beach, Florida Jan 2014 - Feb 2017

(Company Profile: NextEra Energy, Inc. is an American energy company that is the largest electric utility holding company by market capitalization. Its subsidiaries include Florida Power HYPERLINK "https://en.wikipedia.org/wiki/Florida_Power_%26_Light"& HYPERLINK "https://en.wikipedia.org/wiki/Florida_Power_%26_Light" Light (FPL), NextEra Energy Resources (NEER), NextEra Energy Partners, Gulf Power Company, and NextEra Energy Services.)

As a Data Scientist in the Energy Markets division, my responsibilities were to examine, gather, and deploy energy industry competitive intelligence information using industry-related publications, databases, and other sources. I analyzed energy markets and evaluated the economics of specific projects. Developed recommendations for new or improvements to existing research tools. Assisted with the development and maintenance of proprietary in-house and other forecasts, structural databases, and models of regional energy markets, including quantifying ranges on potential outcomes. Performed qualitative and quantitative analysis and quality control on large amounts of data. The principal goal was to generate effective near and short-term electrical energy demand modeling and optimum supply mixture modeling.

•Effectively built a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) using PyFlux, to model the uncertainty of Dominion’s other time series, ensuring a ‘safety’ stock of generating units.

•Validated models using a train-validate-test split to ensure forecasting was sufficient to elevate the optimal output of the number of generation facilities to meet the system load.

•Undertook multiple approaches for predicting day-ahead energy demand with Python, including exponential smoothing, ARIMA.

•Performed feature engineering with the use of NumPy, Pandas, and Feature Tools to engineer time-series features.

•Liaised with facility engineers to understand the problem and ensure our predictions were beneficial.

•Incorporated geographical and socio-economic data scraped from outside resources to improve accuracy.

•Prevented over-fitting with the use of a validation set while training.

•Built a meta-model to ensemble the predictions of several different models.

•Actively participated in daily standups working under an Agile Kan Ban environment.

•Queried Hive by utilizing Spark using Python’s PySpark Library.

•Participated in daily standups working under an Agile Kanban environment.

Sr. Data Analyst – Carnival Cruise Lines, Long Beach, CA Jan 2009 – Dec 2013

(Company Profile: Carnival is one of ten cruise lines owned by the world's largest cruise ship operator, the American-British Carnival Corporation & plc.)

•Enhanced data collection procedures to include information that is relevant for building analytic systems.

•Developed regression analysis using Excel Toolkit to forecast demand for cruise ships.

•Developed dashboards for use by executives for ad hoc reporting using Tableau visualization tools.

•Helped balance a load of ports by forecasting demand.

•Generalized feature extraction in the machine learning pipeline which improved efficiency throughout the system.

•Performed univariate, bivariate, and multivariate analyses and thereby created new features and tested their importance.

•Incorporated data mined and scraped from outside sources.

•Processed, cleansed, and verified the integrity of data used for analysis.

•Solved analytical problems, and effectively communicated methodologies and results.

•Conducted ad-hoc analysis and presentation of results.

•Worked closely with internal stakeholders such as business teams, product managers, engineering teams, and partner teams.

Education

MS Data Science

DePaul University, Chicago, Illinois

Postgraduate Diploma in Education

University of Rwanda, Kigali, Rwanda

Bachelor of Science in Applied Mathematics

Kigali Institute of Science and Technology, Kigali, Rwanda

Contact this candidate