Sign in

Python,R, Git, HTML, SQL,NoSQL(Mongo DB), Spark, machine learning

San Francisco, CA
March 26, 2019

Contact this candidate


Zhi Li

San Francisco, CA, ***** 530-***-**** Portfolio


University of San Francisco M.S in Data Science Jul. 2018 – Expected Jun. 2019

Courses: Machine Learning, Data Visualization, Distributed Computing, Data Acquisition, Time Series Analysis, Relational Database, Data Structure & Algorithms, Computational Statistics, A/B Testing, Experimental Design, Product Analysis

University of Florida Ph.D Candidate in Environmental Horticulture Aug. 2017 – May. 2018

Coursework: Computer Programming with R, Applied Statistics

University of California, Davis M.S in Agronomy & Horticulture Sep. 2015 – Jun. 2017

Coursework: Applied Multivariate Modeling, Advanced Plant Breeding

Northeast Forestry University B.S. in Forestry Resources Sep. 2011 – Jun. 2015

Exchange Program : National Chung Hsing University(Sep. 2013 – Jan. 2014), Harbin Institute of Technology (2011 – 2012)


Data Scientist Intern Oct. 2018 – Current

Project: Vehicle recommender System

Constructed a content-based recommender system to recommend similar vehicles based on user searching history to improve customer conversion rate as well as user experience.

Designed and implemented data extraction with Snowflake ETL and SQL, conducted feature engineering with Python.

Built vehicle recommendation engine by creating high dimensional vehicle embedding with CBOW model and visualized vehicle similarity using TSNE Plots, which outperformed the productionized models by 50% in variance reduction for some vehicle categories.

Project : Vehicle Sales Price Prediction

Provided a dynamic pricing & trading strategy for the company to make vehicle purchasing decision by identifying key price driven factors and interpreting machine learning models and presented the businesses insight to data science team weekly.

Designed predictive models to forest vehicle sales price by applying machine learning algorithms such as Random Forest, XGBoost, and Time series methods to design and ensemble the models.

Research Assistant University of California, Davis Sep 2015 – Jun 2017

Designed and developed the protocols for priming and post-priming processes of tomato seeds lots to improve seed performance and retain seed longevity simultaneously.

Predicted and Quantified seeds longevity by constructing time-aging models in R using dplyr, tidyverse, ggplot2, etc.


California Air Quality Prediction and Data ETL Pipeline

Designed & applied automated data pipeline to bridge AWS S3, MongoDB on AWS EC2, and Spark on AWS EMR.

Developed Logistic Regression and Random Forest models with PySpark MLlib APIs to predict California Air Quality, achieved 81% predictive accuracy and boosted computation time 1000 times than scikit-learn library.

Publication : Co-First Author. “A Scalable and Reliable Model for Real-time Air Quality Prediction”. (Under Review) The 5th IEEE Smart World Congress.

Movie Ranking System

Built collaborative filtering model to predict the potential ratings of movies with matrix factorization.

Encoded rating data to have contiguous ids for users and movies and created embeddings for both users and movies. Optimized collaborative filtering model by gradient descent with momentum.

User In-App Purchase Prediction

Preprocessed and stored large-scale user behavior data(40GB) with AWS EC2 and S3, conducted feature engineering using Python.

Predicted user purchase probability in next 7 and 14 days using machine learning techniques(Logistic Regression, Random Forest, XGBboost, light GBM), ensemble model achieved 0.98 AUC score on test data.


Programming and Tools: Python (Sklearn, Pandas, NumPy, SciPy, XGBoost, Matplotlib, Gensim), PyTorch, R, AWS(EC2, S3, EMR), Bash, Git, Excel, Plotly, Snowflake ETL, Redshift, Tableau, Bokeh, Flask, HTML

Database and Distributed Computing: SQL(PostgreSQL), NoSQL(Mongo DB), Spark

Analysis Techniques: Clustering(K-Means, spectral), PCA, A/B Testing, Gradient Boosting, SVM, Hypothesis testing, Anomaly Detection, Regression, Collaborative Filtering, Neural Networks(CNN, RNN, GAN), NLP(Word2Vec, CBOW)

Languages: English (Bilingual Working Proficiency), Mandarin Chinese (Native)

Contact this candidate