University of San Francisco M.S in Data Science Jul. 2018 – Expected Jun. 2019
Courses: Machine Learning, Data Visualization, Distributed Computing, Data Acquisition, Time Series Analysis, Relational Database, Data Structure & Algorithms, Computational Statistics, A/B Testing, Experimental Design, Product Analysis
University of Florida Ph.D Candidate in Environmental Horticulture Aug. 2017 – May. 2018
Coursework: Computer Programming with R, Applied Statistics
University of California, Davis M.S in Agronomy & Horticulture Sep. 2015 – Jun. 2017
Coursework: Applied Multivariate Modeling, Advanced Plant Breeding
Northeast Forestry University B.S. in Forestry Resources Sep. 2011 – Jun. 2015
Exchange Program : National Chung Hsing University(Sep. 2013 – Jan. 2014), Harbin Institute of Technology (2011 – 2012)
Data Scientist Intern Fair.com Oct. 2018 – Current
Project: Vehicle recommender System
Constructed a content-based recommender system to recommend similar vehicles based on user searching history to improve customer conversion rate as well as user experience.
Designed and implemented data extraction with Snowflake ETL and SQL, conducted feature engineering with Python.
Built vehicle recommendation engine by creating high dimensional vehicle embedding with CBOW model and visualized vehicle similarity using TSNE Plots, which outperformed the productionized models by 50% in variance reduction for some vehicle categories.
Project : Vehicle Sales Price Prediction
Provided a dynamic pricing & trading strategy for the company to make vehicle purchasing decision by identifying key price driven factors and interpreting machine learning models and presented the businesses insight to data science team weekly.
Designed predictive models to forest vehicle sales price by applying machine learning algorithms such as Random Forest, XGBoost, and Time series methods to design and ensemble the models.
Research Assistant University of California, Davis Sep 2015 – Jun 2017
Designed and developed the protocols for priming and post-priming processes of tomato seeds lots to improve seed performance and retain seed longevity simultaneously.
Predicted and Quantified seeds longevity by constructing time-aging models in R using dplyr, tidyverse, ggplot2, etc.
California Air Quality Prediction and Data ETL Pipeline
Designed & applied automated data pipeline to bridge AWS S3, MongoDB on AWS EC2, and Spark on AWS EMR.
Developed Logistic Regression and Random Forest models with PySpark MLlib APIs to predict California Air Quality, achieved 81% predictive accuracy and boosted computation time 1000 times than scikit-learn library.
Publication : Co-First Author. “A Scalable and Reliable Model for Real-time Air Quality Prediction”. (Under Review) The 5th IEEE Smart World Congress.
Movie Ranking System
Built collaborative filtering model to predict the potential ratings of movies with matrix factorization.
Encoded rating data to have contiguous ids for users and movies and created embeddings for both users and movies. Optimized collaborative filtering model by gradient descent with momentum.
User In-App Purchase Prediction
Preprocessed and stored large-scale user behavior data(40GB) with AWS EC2 and S3, conducted feature engineering using Python.
Predicted user purchase probability in next 7 and 14 days using machine learning techniques(Logistic Regression, Random Forest, XGBboost, light GBM), ensemble model achieved 0.98 AUC score on test data.
Programming and Tools: Python (Sklearn, Pandas, NumPy, SciPy, XGBoost, Matplotlib, Gensim), PyTorch, R, AWS(EC2, S3, EMR), Bash, Git, Excel, Plotly, Snowflake ETL, Redshift, Tableau, Bokeh, Flask, HTML
Database and Distributed Computing: SQL(PostgreSQL), NoSQL(Mongo DB), Spark
Analysis Techniques: Clustering(K-Means, spectral), PCA, A/B Testing, Gradient Boosting, SVM, Hypothesis testing, Anomaly Detection, Regression, Collaborative Filtering, Neural Networks(CNN, RNN, GAN), NLP(Word2Vec, CBOW)
Languages: English (Bilingual Working Proficiency), Mandarin Chinese (Native)