NYASHA MAKUTO
+1-437-***-**** Toronto, ON, Canada
Email: *********@*****.*** LinkedIn: linkedin.com/in/nyasha-makuto/ Portfolio (GitHub): github.com/Namakuto SUMMARY
Data Scientist experienced in inferential statistics seeking to apply machine learning methods in a role that specializes in statistical methods such as hypothesis testing, as well as predictive analytics such as deep learning. Currently leveraging Python at CIBC to build GPU-enabled neural networks as well as anomaly detectors. SOFTWARE AND PROGRAMMING SKILLS
Programming Languages and tools: Python, TensorFlow, PyTorch, Scikit-learn, Pandas, SciPy, NumPy, Keras, R, SQL, SAS, Stata, C++ Data Visualization Tools: Jupyter Notebook, Matplotlib, Plotly, Shiny, Seaborn, Tableau Big Data Technologies: PySpark, Apache Spark SQL, Azure Databricks, Hadoop, Hive, Netezza Version Control & Project Management: Git, GitHub, Jira, GitLab, Microsoft Office Statistical Methods: Bayesian statistics, Markov Chain Monte-Carlo; hypothesis testing and bootstrapping; regressions, including time series forecasting and survival analysis; missing data analysis; power and sample size calculations WORK EXPERIENCE
Canadian Imperial Bank of Commerce (CIBC) Oct 2022 - present Data Scientist (Data Scientist, Advanced Threat Detection) Toronto, ON, Canada
Achieved an industry-leading classification accuracy of 99.6% PR AUC, by developing a GPU-enabled NLP-based convolutional neural network (CNN) with JSON, .txt, and .csv data to classify malicious URLs in response to increasing security threats
Detected anomalies in employee activity by implementing a GPU-accelerated ensemble-based anomaly detector (Auto-IF)
Used SQL to query gigabytes of data from REST APIs, conducted version control using Git
Used Jira to track project milestones in an Agile/Scrum environment to deliver projects on time, created data visualizations with Plotly and Matplotlib to present to stakeholders including senior staff Bank of Montreal (BMO) Jun 2021 - Sep 2022
Model Risk Specialist (Model Validation) Toronto, ON, Canada
Improved model recall and PR AUC by 1-5% across multiple projects, by providing feedback on various data science models including logistic regressions, random forests, XGBoost models, CNNs and FaceNet, SVMs, and NLP-based algorithms
Reviewed NLP preprocessing techniques such as TF-IDF, GloVe, BERT (from Transformers, in HuggingFace); CNN-based preprocessing techniques from OpenCV such as Gaussian smoothing and Laplacian operators
Validated Python, SAS, and Spark SQL code in Dataiku to ensure model accuracy
Replicated Hadoop, Hive, and Netezza files to improve data integrity Preteckt Feb 2021 - Apr 2021
Junior Data Scientist Hamilton, ON, Canada
Lead and mentored the data scientist team (including senior data scientists) in frequentist hypothesis testing with R and Python, wrote Python modules for the data science team, used Git to update GitHub repositories with multiple co-authors
Leveraged SQL for data extraction from TimescaleDB and DigitalOcean cloud environments, enabling data analyses in Jupyter Notebooks that drove key business decisions Lakehead University Sep 2018 - Apr 2020
Graduate Assistantship (Data Analyst) Thunder Bay, ON, Canada
Used Stata for statistical analyses on mental health data (chi-squared, Fisher’s Exact, and Odds Ratio calculations), leading to insights that were presented at a Society for Epidemiological Research (SER) conference in Minneapolis, Minnesota EXPERIENCE IN STATISTICS AND OTHER RELEVANT QUALIFICATIONS 4+ Years of experience in statistical modeling and analyses in Python, R, SAS, and Stata Page 2
3 Years in R, 2 years in SAS, 1 year in Stata, 3 years in Python
Conducted multivariate regressions: linear, generalized linear (e.g., Poisson), generalized linear mixed; survival analyses
Design of experiments (DOE): conducted one-way, two-way, nested, blocked ANOVAs; fixed, random, mixed ANOVAs
- Ran other statistical tests (e.g. Shapiro-Wilk, Hosmer-Lemeshow, Levene’s, Bartlett’s, likelihood ratio, Tukey’s)
- Hypothesis testing: e.g. t-tests, Fisher’s Exact, Chi-squared, bootstrapping, permutation, sign and rank tests
Bayesian statistics with Markov Chain Monte-Carlo (MCMC): t-tests via generalized linear models (GLMs), correlation tests
Missing data analysis: Expectation-Maximization (EM), Multiple Imputation (IM); bivariate analyses
Plotting: e.g., kernel density estimate plots, scree plots, violin plots, time series plots, boxplots; model diagnostic plots EDUCATION
Lakehead University - Thunder Bay, ON, Canada
Master’s in Epidemiology - MHSc Epidemiology Sep 2018 - Oct 2020
Honours Bachelor’s in Biology - HBSc Biology Sep 2015 - Jul 2018
Bachelor’s in Chemical Engineering - BEng Chemical Engineering Sep 2014 - Sep 2015 EXTRA-CURRICULAR EDUCATION IN DATA SCIENCE
PrepVector Nov 2023 - present
8-week Product Data Science Bootcamp Remote (United States)
Learned how to conduct various A/B tests including A/A tests, A/B/n tests, and Multi-arm bandit (MAB) testing
Learned how to select business metrics for assessing the performance of product data science models in prodution Udemy Apr 2021 - August 2021
Deep Learning A-Z: Hands-On Artificial Neural Networks using TensorFlow and Pytorch Remote (United States)
Built a 4-layer LSTM recurrent neural network to predict Google stock prices; mean absolute percentage error of 1.45%
Built a convolutional neural network on 8,000 photos (174 MB) to classify dog or cat photos with 78.4% validation accuracy Coursera - IBM Aug 2020 - Apr 2021
Certificate in Machine Learning with Python Remote (United States)
Ran cluster (K-means, hierarchical, DBSCAN) and classification (e.g. KNN, SVM) models on simulated data Coursera - John Hopkins University Mar 2017 - Aug 2017 Certificate in Data Science Remote (United States)
Mined >270,000 sentences from Twitter, News, and Blog feed text in R to develop a predictive-text dashboard in Shiny
Ran a random forest on >19,600 rows of human exercise data to predict exercise grades with 99.2% test set accuracy EXTRA-CURRICULAR EDUCATION IN STATISTICS
Statistical Horizons Dec 2023
3-day Workshop (Applied Bayesian Data Analysis: A Second Course) Remote (United States)
Built various Bayesian models in R, including latent factor models and mixed models
Reviewed the theoretical statistics behind various Bayesian models, including latent factor models and mixed models Monash University, Australia Nov 2023 - Dec 2023
Forecasting: Principles and Practice (3rd ed) Remote (Canada)
Built various time series models: ARIMA, SARIMA, ETS, and Seasonal and Trend decomposition using Loess (STL) in R Quantitative Methods Workshop Series May 2022
3-day Workshop (Introduction to Bayesian Analysis and Monte Carlo Simulation) Remote (Canada)
Performed Monte Carlo simulations in R to estimate means and variances in normal distributions
Performed Bayesian generalized mixed modeling, t-tests, and correlation tests in R