Post Job Free

Resume

Sign in

Senior Data Scientist

Location:
Seattle, WA, 98199
Salary:
95
Posted:
February 11, 2024

Contact this candidate

Resume:

Theo Dushime

Sr. Data Scientist

Contact: 360-***-****

Email: ad2zzs@r.postjobfree.com

Summary:-.

Overall 12+ years of experience in IT along with 8+ years of expert involvement in Artificial Intelligence, Data Mining, Deep Learning, Predictive Analysis, and Machine Learning with big datasets of structured and unstructured data; Proficient in managing entire data science project life cycle and actively involved in all the phases of the project life cycle.

Presently working with Google as a Sr. Data Scientist & ML Engineer with expertise in Data Extraction, Data Modeling, Data Wrangling, Statistical Modelling, Data Mining, Machine Learning, and Data Visualization.

Academically proficient with MS Data Science from DePaul University, Chicago, Illinois backed by Post Graduation Diploma in Education and Bachelor of Science in Applied Mathematics.

Demonstrated excellence in Natural Language Processing (NLP) methods, in particular BERT, ELMO, word2vec, sentiment analysis, Name Entity Recognition, and Topic Modelling Time Series Analysis with AR, MA, ARIMA, GARCH, SARIA, LSTM, RNN, Prophet, and ARCH Models.

Well-versed in using various packages in Python and R like Pandas, NumPy, SciPy, Matplotlib, Seaborn, TensorFlow, Scikit-Learn, and ggplot2.

Broad experience in 3rd-party cloud resources: AWS, Google Cloud, and Azure

Possess a deep understanding of applying Naïve Bayes, Regression, and Classification techniques as well as Neural Networks, Deep Neural Networks, Decision Trees, and Random Forests.

Experience in working with and querying large data sets from big data stores using Hadoop Data Lakes, Data Warehouse, Amazon AWS, Cassandra, Redshift, Aurora, and NoSQL

Competent in statistical analysis programming languages such as R and Python (including Big Data technologies such as Spark, Hadoop, Hive, HDFS, and MapReduce).

Proficient in applying statistical analysis and machine learning techniques to live data streams from big data sources using PySpark and batch processing techniques.

Successful in leading teams to productionize statistical or machine learning models and create APIs or data pipelines for the benefit of business leaders and product managers.

Good knowledge of creating visualizations, interactive dashboards, reports, and data stories using Tableau and Power BI.

Familiar with collaborating with algorithm techniques, including Bagging, Boosting, and Stacking

Hands-on experience with hands-on with PaLM and Open AI Davinci and GPT 2,3,3.5 and GPT-4

Executing EDA to find patterns in business data and communicate findings to the business using visualization tools such as Matplotlib, Seaborn, and Plotly.

Experience in Tracking defects using Bug tracking and Version control tools like Jira and Git.

Strong experience in interacting with stakeholders/customers, gathering requirements through interviews, workshops, and existing system documentation or procedures, defining business processes, and identifying, and analyzing risks using appropriate templates and analysis tools.

Skilled in statistical methodologies including Hypothetical Testing, ANOVA, Time Series, Principal Component Analysis, Factor and cluster Analysis, and Discriminant Analysis.

Experience in Agile Methodology and Scrum Process

Excellent in understanding new subject matter domains and designing and implementing effective novel solutions to be used by other experts.

Excellent written, oral communication, interpersonal, intuitive, analysis, and leadership skills to explain complex Data Science concepts in easily digestible terms for stakeholders and clients.

Technical Skills:

Libraries:

NumPy, SciPy, Pandas, Theano, Caffe, SciKit-learn, Matplotlib, Seaborn, Plotly, TensorFlow, Keras, NLTK, PyTorch, Gensim, Urllib, BeautifulSoup4, PySpark, PyMySQL, SQAlchemy, MongoDB, sqlite3, Flask, Deeplearning4j, EJML, dplyr, ggplot2, reshape2, tidyr, purrr, readr, Apache, Spark.

Machine Learning Techniques:

Supervised Machine Learning Algorithms (Linear Regression, Logistic Regression, Support Vector Machines, Decision Trees and Random Forests, Naïve Bayes Classifiers, K Nearest Neighbors), Unsupervised Machine Learning Algorithms (K Means Clustering, Gaussian Mixtures, Hidden Markov Models, Auto Encoders), Imbalanced Learning (SMOTE, AdaSyn, NearMiss), Deep Learning Artificial Neural Networks, Machine Perception

Analytics:

Data Analysis, Data Mining, Data Visualization, Statistical Analysis, Multivariate Analysis, Stochastic Optimization, Linear Regression, ANOVA, Hypothesis Testing, Forecasting, ARIMA, Sentiment Analysis, Predictive Analysis, Pattern Recognition, Classification, Behavioral Modeling

Natural Language Processing:

Processing Document Tokenization, Token Embedding, Word Models, Word2Vec, FastText, BagOfWords, TF/IDF, Bert, Elmo, LDA

Programming Languages:

Python, R, SQL, Java, MATLAB, and Mathematica

Applications:

Machine Language Comprehension, Sentiment Analysis, Predictive Maintenance, Demand Forecasting, Fraud Detection, Client Segmentation, Marketing Analysis, Cloud Analytics in cloud-based platforms (AWS, MS Azure, Google Cloud Platform)

Deployment:

Continuous improvement in project processes, workflows, automation, and ongoing learning and achievement

Development:

Git, GitHub, GitLab, Bitbucket, SVN, Mercurial, Trello, PyCharm, IntelliJ, Visual Studio, Sublime, JIRA, TFS, Linux

Big Data and Cloud Tools:

HDFS, SPARK, Google Cloud Platform, MS Azure Cloud, SQL, NoSQL, Data Warehouse, Data Lake, SWL, HiveQL, AWS (RedShift, Kinesis, EMR, EC2, Lambda

Professional Experience:

Sr. Data Scientist & ML Engineer – Google, Mountain View, CA Jun 2023 – Now

(Company Profile: Google Inc is an American Multinational Technology company focusing on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence.)

\

As a Sr. Data Scientist at Google, I used NLP for sentiment analysis of customer reviews to gain insight into how pleased customers are with the products of Google Shopping and the overall shopping experience thereby contributing to driving the business growth and enhancing customer engagement through data-driven insights and predictive modelling. We also analysed customer search queries to identify frequently used keywords in those queries and optimized product titles and descriptions to include those keywords to improve the product’s visibility in search results. I successfully developed and deployed NLP models using deep learning techniques resulting in a 20% increase in customer satisfaction scores and a 15% increase in sales.

Utilized NLP techniques such as sentiment analysis with BERT, text classification with ULM Fit, named entity recognition with Spacy, and Text summarization with GPT-3, to gain insight into customer satisfaction (sentiment analysis score), improve customer experience, and increase sales.

Analysed customer search queries and product descriptions using NLP and deep learning techniques, such as keyword extraction with TF-IDF, embedding with Word2Vec, topic modelling with LDA, and language translation with Open NMT, to improve search engine optimization (SEO improvement) and product recommendations, resulting in increased visibility and sales.

Analysed historical customer data and developed a robust model that accurately predicted churn, enabling us to implement proactive retention strategies.

Developed customer personas based on various attributes, which allowed us to implement targeted marketing strategies and personalized customer experiences.

Built customer churn predictive models using Logistic Regression, Random Forest, and XG Boost to identify potential customer churn.

Deployed models on GCP Vertex AI for easy scalability and accessibility.

Deployed models on Cloud Elasticsearch for improved search capabilities and Cloud API for natural language processing.

Analysed purchase quantity, purchase probability, and brand choice based on pricing variations, assisting in pricing strategy decisions.

Employed Deep Learning techniques using TensorFlow and Kera to predict the probability of customer conversion, optimizing our marketing spend.

Utilized tools such as TensorFlow, PyTorch, NLTK, and Spacy to implement NLP and deep learning techniques in e-commerce applications.

Achieved significant improvements in customer satisfaction (sentiment analysis score), sales (product recommendation accuracy), and search engine optimization (SEO improvement) by utilizing NLP and deep learning techniques in e-commerce projects and deploying them on GCP for easy scalability and accessibility, using GCP services like BigQuery, Cloud Storage, Dataflow.

Developed a SARIMAX model to forecast sales of multiple stores and shared a dashboard of the forecasts with store managers to aid in appropriate inventory restocking.

Extracted valuable insights from big data using data mining techniques using Spark’s Python API (PySpark) and ETL pipelines.

Containerized the developed models in Docker containers and deployed them on a Kubernetes cluster.

Applied Feature Engineering for dimensionality reduction and to improve the models’ performance.

Conducted Market Segmentation using cluster analysis (Hierarchical Clustering and K-Means) and dimensionality reduction techniques (PCA) to effectively segment the market.

Created engaging data visualizations and live interactive dashboards using Tableau, Power BI, and Plotly for visualizations.

Collaborated with stakeholders by effectively communicating project progress and results and presenting data-driven recommendations.

Sr. Data Scientist (AI/ML Engineering) – United Health Group, New York, NY May 2021 – Jun 2023

(Company Profile: UnitedHealth Group is a diversified healthcare company that operates through four segments: UnitedHealthcare, Optum Health, Optum Insight, and Optum Rx.)

As Sr. Data Scientist I developed and improved machine learning models to integrate OCR technology into the healthcare provider's existing workflow, resulting in a more streamlined and efficient process for data entry and claims processing. I used OCR technology to extract text from various types of healthcare documents, such as medical records, prescriptions, lab reports, and insurance claims to automate data entry and streamline claims processing.

Applied an OCR-based system to spontaneously extract patient information from medical documents, which simplified the data entry process and improved patient care.

Implemented connectionist temporal classification (CTC) loss function to improve the performance of RNN-based OCR models for recognizing text in variable-length documents.

Utilized OCR and NLP models to automatically extract codes and information required for claims processing, reducing the need for manual intervention, and increasing the speed and accuracy of the claims processing process.

Developed and implemented advanced OCR models using deep learning techniques like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to extract data from diverse healthcare forms, including prescriptions, lab reports, and patient information forms.

Deployed OCR and NLP models on Google Kubernetes Engines (GKS) and Google Vertex AI for scalable and efficient processing of healthcare documents.

Utilized transfer learning techniques to fine-tune pre-trained CNNs and RNNs models on a healthcare-specific dataset, resulting in a significant reduction in training time and an increase in accuracy.

Developed OCR models that can extract data from different healthcare forms, such as prescriptions, lab reports, and patient information forms, reducing manual data entry by 75%.

Utilized OCR technology to extract text from insurance claims, reducing the number of denied claims by 20% and increasing the efficiency of the claims processing process.

Implemented a hybrid OCR system that combines traditional rule-based OCR with deep learning-based OCR for improved accuracy and flexibility in handling different types of documents.

Created ensembles of CNNs and RNNs models for text recognition in images, resulting in a significant increase in the accuracy and robustness of the OCR system.

Utilized natural language processing (NLP) techniques such as named entity recognition (NER) and text classification to extract structured data from unstructured text, improving the accuracy of the OCR system.

Utilized Vertex AI (AML) for model deployment, monitoring, and management of the OCR and NLP models in a production environment.

Implemented Google Cloud Functions to trigger the OCR and NLP models for real-time processing of incoming healthcare documents.

Set up Azure Log Analytics for monitoring and troubleshooting the OCR and NLP models in production.

Used Azure Cognitive Services such as Computer Vision and Text Analytics to enhance the OCR and NLP model’s capabilities.

Utilized tools such as TensorFlow, PyTorch, NLTK, and Spacy to implement NLP and deep learning techniques in e-commerce applications

Developed custom NLP models for extracting specific information from medical documents such as patient demographics, diagnosis, and treatment information.

Utilized Azure Blob Storage for storing and managing large amounts of healthcare documents and the OCR output.

Sep 2018 - Feb

Data Scientist - Citigroup, New York, NY May 2019 – Apr 2021

(Company Profile: Citigroup Inc. is an American multinational investment bank and financial services corporation headquartered in New York City.)

I worked with their Cyber Security and Fraud departments to improve their credit/debit card fraud detection using Machine Learning Algorithms. My primary function was to design a Data-Driven model that detected all cases of fraud while limiting false positives. Statistical Methods were used and as the datasets involved were highly imbalanced, innovative solutions were implemented to ensure the highest confidence in the predictions and outlier detection.

Performed Data Cleaning, features scaling, and features engineering using pandas and NumPy packages in Python and built models using deep learning frameworks.

Used Scikit-Learn’s model selection framework to perform hyper-parameter tuning using Grid Search CV and Randomized Search CV algorithms.

Stratified imbalanced data to ensure fair representation of the minority data in all data sets used for cross-validation of the model.

Used R’s dplyr for data manipulation, as well as ggplot2 for data visualization and EDA.

Extracted data from Hive databases on Hadoop using Spark through PySpark.

Utilized Scikit-Learn, SciPy, Matplotlib, and Plotly for EDA and data visualization.

Consulted with regulatory and subject matter experts to gain a clear understanding of information and variables within data streams.

Built Artificial Neural Network models to detect anomalies and fraud in transaction data.

Performed data mining and developed statistical models using Python to provide tactical recommendations to business executives.

Programmed a utility in Python that used multiple packages (SciPy, NumPy, pandas).

Deployed model using a Flask API stored through a Docker container.

Evaluated the performance of our model using a confusion matrix, accuracy, recall, precision, and F1 score. Took careful consideration of the recall score.

Utilized Git for version control on GitHub to collaborate and work with the team members.

Designed dashboards with Tableau and provided complex reports, including summaries, charts, and graphs to interpret findings to team and stakeholders.

Worked with Data Engineers for database design for Data Science.

Developed unsupervised K-Means and Gaussian Mixture Models (GMM) from scratch in NumPy to detect anomalies.

Employed a heterogeneous stacked ensemble of methods for the final decision on what transaction was fraudulent.

Use Git to apply version control. Tracked changes in files and coordinated work on the files among multiple team members.

Implemented a Python-based distributed random forest.

Used predictive modeling with tools in SAS, SPSS, R, and Python.

Data Scientist – PepsiCo Inc., Harrison, New York, NY Mar 2017 - Apr 2019

(Company Profile: PepsiCo Inc. is an American Multinational food, snack, and beverage corporation headquartered in Harrison, New York, NY.)

At PepsiCo, I led a team of data scientists from around the world to develop demand forecasting and sales prediction models. In this capability, I worked to educate leadership on the role of data science within an organization and connected with stakeholders to ensure that any data science work we performed was both relevant and valuable to the company.

Successfully developed a demand forecasting model based on IRI syndicated data to assist demand planners with effectively allocating resources.

Developed POC based on said dataset using NLP, and worked with LDA (Latent Dirichlet Analysis) and ELMO (bidirectional LSTM) to extract and cluster key topics.

Utilized machine-learning models to implement a high-performing demand forecasting framework from scratch.

Despite an extremely small sample size, the model obtained consistent, high-quality results using hierarchical modeling (MLib/GBT)

Worked with stakeholders extensively to provide updates, and tailor model characteristics to better assist end-users.

Advised on how best to modify existing predictive out-of-stock models to accurately forecast for a longer time horizon.

Combined several disparate data sources, at different granularities (sales data, economic data, SNAP spending, and more) into one master dataset.

Worked in PySpark, Python, on Azure Databricks

The model is a significant improvement over the baseline, univariate time-series forecasts that were used previously.

Created dataset leveraging numerous, disparate sources (outside data from customers, internal order data, weather data, and transportation data)

Used PCA to deal with the very high (hundreds) number of sparse, categorical variables.

Developed POC which leveraged NLP techniques and clustering algorithms (K-Means) to attribute causes to negative reviews.

Used web scraping to create a dataset of Amazon reviews from scratch.

Worked with a variety of Python packages including NumPy, NLTK, Pandas, Torch/PyTorch, and Regex.

Built a model to determine orders that are at risk of being delivered late. This would allow individuals along the supply chain to intervene and avoid the associated fees.

ML-Ops Engineer – Nextera Energy, Juno Beach, Florida Jan 2015 - Feb 2017

(Company Profile: NextEra Energy, Inc. is an American energy company that is the largest electric utility holding company by market capitalization. Its subsidiaries include Florida Power & Light (FPL), NextEra Energy Resources (NEER), NextEra Energy Partners, Gulf Power Company, and NextEra Energy Services.)

As a Data Scientist in the Energy Markets division, my responsibilities were to examine, gather, and deploy energy industry competitive intelligence information using industry-related publications, databases, and other sources. I analyzed energy markets and evaluated the economics of specific projects. Developed recommendations for new or improvements to existing research tools. Assisted with the development and maintenance of proprietary in-house and other forecasts, structural databases, and models of regional energy markets, including quantifying ranges on potential outcomes. Performed qualitative and quantitative analysis and quality control on large amounts of data. The principal goal was to generate effective near and short-term electrical energy demand modeling and optimum supply mixture modeling.

Effectively built a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) using PyFlux, to model the uncertainty of Dominion’s other time series, ensuring a ‘safety’ stock of generating units.

Validated models using a train-validate-test split to ensure forecasting was sufficient to elevate the optimal output of the number of generation facilities to meet the system load.

Undertook multiple approaches for predicting day-ahead energy demand with Python, including exponential smoothing, ARIMA.

Performed feature engineering with the use of NumPy, Pandas, and Feature Tools to engineer time-series features.

Liaised with facility engineers to understand the problem and ensure our predictions were beneficial.

Incorporated geographical and socio-economic data scraped from outside resources to improve accuracy.

Prevented over-fitting with the use of a validation set while training.

Built a meta-model to ensemble the predictions of several different models.

Actively participated in daily standups working under an Agile Kan Ban environment.

Queried Hive by utilizing Spark using Python’s PySpark Library.

Participated in daily standups working under an Agile Kanban environment.

Sr. Data Analyst – Royal Caribbean, Miami, Florida Jan 2011 – Dec 2014

(Company Profile: Royal Caribbean International (RCI), is the largest cruise line by revenue and second-largest by passenger counts headquartered in Miami, Florida.)

Enhanced data collection procedures to include information that is relevant for building analytic systems.

Developed regression analysis using Excel Toolkit to forecast demand for cruise ships.

Developed dashboards for use by executives for ad hoc reporting using Tableau visualization tools.

Helped balance a load of ports by forecasting demand.

Generalized feature extraction in the machine learning pipeline which improved efficiency throughout the system.

Performed univariate, bivariate, and multivariate analyses and thereby created new features and tested their importance.

Incorporated data mined and scraped from outside sources.

Processed, cleansed, and verified the integrity of data used for analysis.

Solved analytical problems, and effectively communicated methodologies and results.

Conducted ad-hoc analysis and presentation of results.

Worked closely with internal stakeholders such as business teams, product managers, engineering teams, and partner teams.

Education

MS Data Science

DePaul University, Chicago, Illinois

Postgraduate Diploma in Education

University of Rwanda, Kigali, Rwanda

Bachelor of Science in Applied Mathematics

Kigali Institute of Science and Technology, Kigali, Rwanda



Contact this candidate