DEREK LIU
Data Scientist
EDUCATION
University of New Haven / Galvanize, San Francisco
Data Science, Master
University of California, Irvine
Economics/Mathematics, BA
EXPERIENCE
Data Science Intern, Scientific Revenue, San Francisco, CA Aug 2017 – Dec 2017
Led a project using Neural Networks to build a Customer Lifetime Value prediction model with real-world gaming data(Terabytes)
Performed model optimization on various models including customer segmentation and churn classification by grid-search and cross-validation
Provided detailed analysis reports with clear visualizations on customers’ behaviors based on their monetary value and conversion rate in hourly basis
Operation Team Administrator, Panda Restaurant Group, Rosemead, CA Aug 2016 – Nov 2016
Assisted with fulfillment of store inventory and maintained adequate inventory levels by placing orders to ensure sufficient inventory is available to the stores
Responsible for updating the inventory data and managing the e-commerce software to ensure an accurate status of the inventory
provided customer service related to shipping and inventory issues
Real Estate Analyst, Sunny Valley LLC, Alhambra, CA
July 2015 -- June 2016
Analyzed market trends of properties in assigned areas by compiling data into financial models for investment evaluation by senior management
Prepared forecast and variance analysis on a weekly basis
In charge with lease preparation and resolving any leasing issues with landlords/proper managers
PROJECT
Best movie gross predictor Feb 2017 - https://github.com/derekliu7/Stats-Final-Project
Find the best movie gross predictor by fitting various statistic models on the IMDB Movie Dataset.
-Exploratory data analysis(EDA), data cleaning, and feature engineering
-Used Seaborn and Matplotlib to show the feature correlation and the distribution of the sample data
-Identified the most important features by comparing the significance level of each features using an OLS model
Technologies: Pandas, Numpy, Matplotlib, Statsmodels, Scipy, Seaborn
Meetup Rsvp May 2017 - https://github.com/derekliu7/DE-Final-Project
Used various technologies to build a data pipeline to consume RSVP data from the Meetup API, then applied machine learning algorithm to the data.
-Built a data pipeline by using Websocket to stream from Meetup API and store into AWS S3
-Performed real-time analysis using Kefka Spark streaming
-Classified event by its description using Multinomial Naïve Bayes algorithm
Technologies: Pandas, Numpy, Matplotlib, Sklearn, Pyspark, NLTK, Websocket, AWS, Kefka
Photo Interestingness July 2017 - https://github.com/derekliu7/DL-Final
Trained a CNN model over 5000 movie screenshots to determine the interesting level of the content as it shows.
-Balanced the dataset by using SMOTE and re-weighting the classes
-Used three different pre-trained Convolutional Neural Network (CNN) models (VGG19, ResNet50, InceptionV3) to compare the results
Technologies: Pandas, Numpy, Matplotlib, Sklearn, Keras, SMOTE
Amazon reviews classification Sep 2017 - https://github.com/derekliu7/NLP-Final
Used Natural Language Processing to determine whether a product has a positive rating or negative rating based on its user reviews.
-Applied regex to clean the text data
-Used TFIDF to vectorize(unigram and bigram) over 1.6 million user views
-Built a Logistic Regression model and a SGD model both with over 80% accuracy
Technologies: Pandas, Numpy, Matplotlib, Sklearn, NLTK, Seaborn
Customer Lifetime Value Prediction Oct 2017
Built a Customer Lifetime Value prediction model with Multilayer Perceptron.
-Trained an Autoencoder to perform unsupervised feature engineering
-Applied different sampling methods (SMOTE, Oversampling) on the training data for model comparison
-Built a joint Multi-perceptron model by training a classification model and a regression model which outperformed the Random Forest and XGB models
Technologies: Pandas, Numpy, Matplotlib, Sklearn, Redshift SQL, Keras, SMOTE, XGBoost, Scipy
Gym Crowdedness prediction Dec 2017 - https://github.com/derekliu7/TimeSeries-Analysis
Used Time Series to predict the total number of people in the next hour in a local gym.
- Feature engineering, plotting the distribution of the headcount hourly and weekly, defining the correlation between each feature
-Built an ARIMA model using the historical data which outperformed the Random Forest model with a RMSE of 8.8
Technologies: Pandas, Numpy, Matplotlib, Sklearn, Scipy, Seaborn, Statsmodels