Y I W E N (ZOE)ZH OU
College Station, Texas *****
Cell: 979-***-**** Email: ************@*****.*** E D U C A T I O N
Texas A&M University College Station, TX Sept. 2017 – May.2020(expected) Master in Statistics GPA: 3.9/4.0
Nanjing University of Technology Nanjing, China Sept. 2013 – June.2017 Bachelor in Materials Science and engineering GPA: 3.8/4.0 Top-grade scholarships of Intensive Program T E C H N I C A L S K I L L S
• Language: R, Python, Linux, SQL, Tableau, SAS, C++, MATLAB, Octave
• Database: MySQL, PostgreSQL
• Framework: Data mining, Big Data (MapReduce, Hadoop, Hive)
• Certificate: SAS Certified Base Programmer for SAS 9 W O R K E X P E R I E N C E
SHELL Houston, TX
Data Scientist Intern Sep. 2019 – Dec. 2019
• Applied different machine learning models like XGBoost, LightGBM, Random Forest and Neural Network to predict Dew Point Pressure of petroleum, compared with traditional method applied in oil industry.
• Constructed data cleaning and visualization using packages like matplotlib, numpy and seaborn. Implemented Feature engineering and select important features using feature-grouping, stepwise selection and Lasso regression methods.
• Created Artificial proxy features according to Elashakarwy’s correlation and improved the model prediction accuracy by 14 % with the MARE: 7.299 %.
Texas A&M University, Department of Horticultural Sciences College Station, TX Research Assistant June. 2019 – Aug. 2019
• Cleaned and preprocessed gene data of Pecan and upload the data to ‘Structure’ software to calculate the population structure and do the genomic predition analysis in Unix environment.
• Imported Phenotypic and Genotypic data into ‘Gapit’ package in R to find the significant genes which influence the genomic structure by using compressed mixed linear model, relgular and Multiple Locus Mixed linear model.
• Analyzed the data based on PC plot, Kinship plot, Neighbor-Joining Tree and Manhatten Plot. SOFWERX Tampa, Florida (Remote)
Data Scientist Intern Jan. 2019 – Apr. 2019
• Created an automated solution that will allow a user to mimic a voice using text-to-speech.
• Implemented deep neural networks and added LSTM layers as substitutes for TTS system components to speed up the system and improved the text to speech pipelines.
• Increased the user click through rate of the product on the website by 1.2%. P R O J E C T S
TAMU Data Science Competition Project: LA Metro Bike Share Data Analysis Jan. 2019
• Implemented different Machine Learning models to predict the number of bicycles needed at each station, evaluating the pricing system to make recommendations for possible pricing changes and expanding the network.
• Applied Random Forest, Generalized Linear Model(GLM) with Poisson regression, SARIMA and Prophet time series models to predict bicycle demand and ticket sales for forecasting demand for the coming month.
• Evaluate the differences between the four regions in LA by visualizing bike stations on Google map by ’Boken’ tool in Python and extracted success characteristics in different regions. Improved the revenue of Metro Bike Share program by 2 % after implementing the network management. Conversion Rate Optimization Challenge Sep. 2018
• Developed machine leaning models to predict conversion rate and come up with recommendations for the product team and the marketing team to improve conversion rate.
• Performed data preprocessing and investigate the distribution and statistics of different features.
• Implemented Random Forest model to predict conversion rate and utilized partial dependence plot and visualized decision tree plot to check important features and get more insights from the data. Yelp NLP restaurant recommendation systems Dec. 2018
• Built a restaurant recommendation system according to the history data from Yelp website in Python.
• Fitted TfidfVectorizer to get NLP representation of text data and cluster the reviews with K-Means. Applied Logistic Regression and Random Forest to classify the positive and negative reviews according to user ratings.
• Created the recommendation system with different methods: Content-based, Collaborative Filtering: user-based and item-based, SVD Matrix Factorization and UV Matrix Factorization.It turns out SVD with the lowest RMSE: 1.74 and improved the average customer rating score by 0.17 is the best one.