RELOCATABLE Charlottesville, VA 434-***-**** firstname.lastname@example.org
University of Virginia, Charlottesville, VA 08/2016 – 05/2018 M.S. in Statistics GPA: 3.77
Sichuan University, Sichuan, China 09/2012 – 06/2016 B.S. in Mathematics
Python SQL R Spark MySQL Tableau Microsoft Office Bilingual - Mandarin and English
Machine Learning Data Analysis Data Visualization EXPERIENCE
National Bureau of Statistics, Chengdu, China Data Analyst Internship 07/2014 – 10/2014 02/2016 – 07/2016
Optimized thousands of DML queries in MySQL server using execution plan, to better utilize indexes, accelerating queries by 30% - 220%.
Optimized indexes design and redesigned database schema, achieving at least 40% query better performance.
Built and maintained SQL scripts, indexes and queries for data analysis and extraction on 15,000+ residents’ 60-col survey data via MySQL. Chengdu Institute of Standardization, Chengdu, China Research Assistant Internship 02/2014 – 12/2014
Conducted data extraction, cleaning and aggregation on over one million rows of quality index data using SQL queries.
Visualized quality index data with hundreds of graphs by R (ggplot2) and Tableau, for annual Chengdu quality index report.
Taught intern how to visualize data using R, followed-up on progress and provided feedback for improvement. PROJECTS
Recommendation System Design Project 10/2017 – 01/2018
Built a movie recommendation model locally with both Python and PySpark (MLlib) to predict unobserved user ratings.
Processed over 26-million movie ratings from Movielens using model-based collaborative filtering with ALS algorithm.
Optimized hyper-parameters using K-fold cross validation on a small portion of data, with RMSE and AUC as metrics.
Reached a model prediction RMSE of 0.91 and AUC of 0.823 on testing data. Model is ready for recommendation using top K ratings. Natural Language Processing Project 07/2017 – 08/2017
Clustered top 100 movies from IMDB and Wikipedia by building an unsupervised model with Python (NLTK, scikit-learn).
Utilized TF-IDF to transfer documents into weight matrix after tokenization and stemming.
Built K-means model for clustering and optimized cluster number by selecting the smallest within-in cluster sum of squares of distances.
Evaluated results by analyzing cluster keywords and visualizing using Python with the help of principal component analysis. Lending Club Loan Analysis Project 05/2017 – 07/2017
Constructed multilayer perception model with Python (pandas, scikit-learn) to predict whether a customer’s loan would default.
Cleaned one million rows of data, using Python to fill in missing values with means, modes or predictions.
Identified top 27 customer information features applicable to both old and new users, enabling program efficiency.
Achieved overall accuracy of 0.99, precision of 0.99 and recall of 0.98 through optimization of model hyper-parameters using K-fold cross validation, evaluating model by confusion matrix.
User Churn Prediction Project 11/2016 – 01/2017
Constructed models with Python (pandas, Matplotlib, scikit-learn) to predict telephone subscriber turnover, using models including random forest, K nearest neighbors and logistic regression with hyper-parameters tuned by K-fold grid search cross validation.
Processed 4,700 rows of imbalanced and insufficient user records, containing under 250 churned records, from SGI ML database.
Tried different over-sampling methods to create synthetic data for rebalancing purposes, evaluated models by confusion matrix.
Random forest, chosen as the final model, utilizing synthetic data created by SMOTE, reached the optimized precision of 0.75 and recall of 0.80 for rare class (churned), with overall accuracy of 0.98 on testing data.