Post Job Free
Sign in

Data Scientist Machine Learning

Location:
Uptown, ON, M4S 2G7, Canada
Posted:
March 11, 2024

Contact this candidate

Resume:

NYASHA MAKUTO

+1-437-***-**** Toronto, ON, Canada

Email: *********@*****.*** LinkedIn: linkedin.com/in/nyasha-makuto/ Portfolio (GitHub): github.com/Namakuto SUMMARY

Data Scientist experienced in inferential statistics seeking to apply machine learning methods in a role that specializes in statistical methods such as hypothesis testing, as well as predictive analytics such as deep learning. Currently leveraging Python at CIBC to build GPU-enabled neural networks as well as anomaly detectors. SOFTWARE AND PROGRAMMING SKILLS

Programming Languages and tools: Python, TensorFlow, PyTorch, Scikit-learn, Pandas, SciPy, NumPy, Keras, R, SQL, SAS, Stata, C++ Data Visualization Tools: Jupyter Notebook, Matplotlib, Plotly, Shiny, Seaborn, Tableau Big Data Technologies: PySpark, Apache Spark SQL, Azure Databricks, Hadoop, Hive, Netezza Version Control & Project Management: Git, GitHub, Jira, GitLab, Microsoft Office Statistical Methods: Bayesian statistics, Markov Chain Monte-Carlo; hypothesis testing and bootstrapping; regressions, including time series forecasting and survival analysis; missing data analysis; power and sample size calculations WORK EXPERIENCE

Canadian Imperial Bank of Commerce (CIBC) Oct 2022 - present Data Scientist (Data Scientist, Advanced Threat Detection) Toronto, ON, Canada

Achieved an industry-leading classification accuracy of 99.6% PR AUC, by developing a GPU-enabled NLP-based convolutional neural network (CNN) with JSON, .txt, and .csv data to classify malicious URLs in response to increasing security threats

Detected anomalies in employee activity by implementing a GPU-accelerated ensemble-based anomaly detector (Auto-IF)

Used SQL to query gigabytes of data from REST APIs, conducted version control using Git

Used Jira to track project milestones in an Agile/Scrum environment to deliver projects on time, created data visualizations with Plotly and Matplotlib to present to stakeholders including senior staff Bank of Montreal (BMO) Jun 2021 - Sep 2022

Model Risk Specialist (Model Validation) Toronto, ON, Canada

Improved model recall and PR AUC by 1-5% across multiple projects, by providing feedback on various data science models including logistic regressions, random forests, XGBoost models, CNNs and FaceNet, SVMs, and NLP-based algorithms

Reviewed NLP preprocessing techniques such as TF-IDF, GloVe, BERT (from Transformers, in HuggingFace); CNN-based preprocessing techniques from OpenCV such as Gaussian smoothing and Laplacian operators

Validated Python, SAS, and Spark SQL code in Dataiku to ensure model accuracy

Replicated Hadoop, Hive, and Netezza files to improve data integrity Preteckt Feb 2021 - Apr 2021

Junior Data Scientist Hamilton, ON, Canada

Lead and mentored the data scientist team (including senior data scientists) in frequentist hypothesis testing with R and Python, wrote Python modules for the data science team, used Git to update GitHub repositories with multiple co-authors

Leveraged SQL for data extraction from TimescaleDB and DigitalOcean cloud environments, enabling data analyses in Jupyter Notebooks that drove key business decisions Lakehead University Sep 2018 - Apr 2020

Graduate Assistantship (Data Analyst) Thunder Bay, ON, Canada

Used Stata for statistical analyses on mental health data (chi-squared, Fisher’s Exact, and Odds Ratio calculations), leading to insights that were presented at a Society for Epidemiological Research (SER) conference in Minneapolis, Minnesota EXPERIENCE IN STATISTICS AND OTHER RELEVANT QUALIFICATIONS 4+ Years of experience in statistical modeling and analyses in Python, R, SAS, and Stata Page 2

3 Years in R, 2 years in SAS, 1 year in Stata, 3 years in Python

Conducted multivariate regressions: linear, generalized linear (e.g., Poisson), generalized linear mixed; survival analyses

Design of experiments (DOE): conducted one-way, two-way, nested, blocked ANOVAs; fixed, random, mixed ANOVAs

- Ran other statistical tests (e.g. Shapiro-Wilk, Hosmer-Lemeshow, Levene’s, Bartlett’s, likelihood ratio, Tukey’s)

- Hypothesis testing: e.g. t-tests, Fisher’s Exact, Chi-squared, bootstrapping, permutation, sign and rank tests

Bayesian statistics with Markov Chain Monte-Carlo (MCMC): t-tests via generalized linear models (GLMs), correlation tests

Missing data analysis: Expectation-Maximization (EM), Multiple Imputation (IM); bivariate analyses

Plotting: e.g., kernel density estimate plots, scree plots, violin plots, time series plots, boxplots; model diagnostic plots EDUCATION

Lakehead University - Thunder Bay, ON, Canada

Master’s in Epidemiology - MHSc Epidemiology Sep 2018 - Oct 2020

Honours Bachelor’s in Biology - HBSc Biology Sep 2015 - Jul 2018

Bachelor’s in Chemical Engineering - BEng Chemical Engineering Sep 2014 - Sep 2015 EXTRA-CURRICULAR EDUCATION IN DATA SCIENCE

PrepVector Nov 2023 - present

8-week Product Data Science Bootcamp Remote (United States)

Learned how to conduct various A/B tests including A/A tests, A/B/n tests, and Multi-arm bandit (MAB) testing

Learned how to select business metrics for assessing the performance of product data science models in prodution Udemy Apr 2021 - August 2021

Deep Learning A-Z: Hands-On Artificial Neural Networks using TensorFlow and Pytorch Remote (United States)

Built a 4-layer LSTM recurrent neural network to predict Google stock prices; mean absolute percentage error of 1.45%

Built a convolutional neural network on 8,000 photos (174 MB) to classify dog or cat photos with 78.4% validation accuracy Coursera - IBM Aug 2020 - Apr 2021

Certificate in Machine Learning with Python Remote (United States)

Ran cluster (K-means, hierarchical, DBSCAN) and classification (e.g. KNN, SVM) models on simulated data Coursera - John Hopkins University Mar 2017 - Aug 2017 Certificate in Data Science Remote (United States)

Mined >270,000 sentences from Twitter, News, and Blog feed text in R to develop a predictive-text dashboard in Shiny

Ran a random forest on >19,600 rows of human exercise data to predict exercise grades with 99.2% test set accuracy EXTRA-CURRICULAR EDUCATION IN STATISTICS

Statistical Horizons Dec 2023

3-day Workshop (Applied Bayesian Data Analysis: A Second Course) Remote (United States)

Built various Bayesian models in R, including latent factor models and mixed models

Reviewed the theoretical statistics behind various Bayesian models, including latent factor models and mixed models Monash University, Australia Nov 2023 - Dec 2023

Forecasting: Principles and Practice (3rd ed) Remote (Canada)

Built various time series models: ARIMA, SARIMA, ETS, and Seasonal and Trend decomposition using Loess (STL) in R Quantitative Methods Workshop Series May 2022

3-day Workshop (Introduction to Bayesian Analysis and Monte Carlo Simulation) Remote (Canada)

Performed Monte Carlo simulations in R to estimate means and variances in normal distributions

Performed Bayesian generalized mixed modeling, t-tests, and correlation tests in R



Contact this candidate