RUOXIN LI
Mill Creek WA *****·217-***-****·************@*****.***·https://www.linkedin.com/in/ruoxin-li/
Data analyst with three years of experience in ETL process, statistical modeling, machine learning, and data visualization. Strong statistical knowledge and solid programming skills in Python and SQL.
Education
University of Illinois Champaign IL GPA: 3.83
Master of Science in Geographic Information Science December 2017
Shanxi University of Finance and Economics Taiyuan China GPA:3.44
Bachelor of Science in Environmental Science July 2016
Certificate: IBM AI Engineering. Certificate issued in February 2020.
Skills
Programming skills: proficient at Python, SQL, R, familiar with HTML, JavaScript
Tools: SQL Server, Tableau, NumPy, Pandas, Sklearn, Hive, Pytorch, Spark
Experiences
Geonamic Systems Inc. Duluth GA(remotely working now)
Data Analyst September 2018 – Present
Collected data from the inline inspection and alignment sheet extraction. Built a Logistic regression model in Python to identify whether two data points from different sources are aligned. Reduced 50% of the processing time compared with the manual work.
Bridged the gap between developers and clients, ensured the quality and the schedule of the product. Documented release notes.
Managed the liquid pipeline risk analysis, created R package to batch process the data filtering and reporting procedure, delivered a dynamic dashboard in excel that contains pipeline profile, and release volume. Document methodology for internal training.
Smart Steps Digital Technology Company Limited Beijing, China
Data Analyst Internship July 2018 – August 2018
Performed the customer churn prediction based on the labeled data from China Unicom.
Used Hive SQL to access the data, preprocessing the data with categorical data encoding and numerical data standardization. Analyzed the key factors with the correlation matrix. Trained supervised machine learning model including Logistic Regression (LR), Random Forest (RF), and K-Nearest Neighbors (KNN). Applied regularization with optimal hyperparameters to overcome overfitting.
Utilized confusion matrix to evaluate the performance of models via cross-validation, deployed RF model with an accuracy of 85.6%.
Visualized the result using Tableau dashboards and presented it to the manager.
Project
Amazon Prime Video Rating Prediction March 2020
Collected data from Amazon Prime Video open dataset, filled the missing data with mean values. Encoded the categorical data and standardized the numerical data.
Built linear regression with L1 and L2 regularization and random forest model. Used the grid search to tune the hyper parameters. Selected RF with the lowest RMSE and highest R square of 0.51.
Analyzed the feature importance using random forest. Noticed the most influenced features are the positions of the video on the page.
Movie Recommendation Engine Development in Apache Spark February 2020
Built data pipeline to analyze movie rating dataset with collaborative filtering and conducted OLAP with Spark SQL
Implemented the Alternative Least Square model to provide personalized movie recommendations and developed user-based approaches to handle system cold-start problems
Tuned model hyper-parameters with Spark -ML cross evaluation including regularization parameters, number of iteration and features with the lowest RMSE as 0.75.
NLP Topic Modeling January 2020
Conducted latent sentiment detection from 100k customer reviews using the clustering techniques.
Preprocess the text file of customer review with tokenization, stemming, removing stop words, extracted features by Term Frequency – Inverse Document Frequency (TFIDF) and Count.
Trained unsupervised machine learning model of K-means clustering and Latent Dirichlet Analysis in Python. Identified latent topics and key words of each review groups.