New York University, New York City, NY May 2021
• Master of Science in Data Science, GPA: 3.89/4.0
• Coursework: Introduction to Data Science, Probability and Statistics for Data Science, Probabilistic Time Series Analysis, Machine Learning, Big Data, Natural Language Understanding Rensselaer Polytechnic Institute, Troy, NY August, 2015 - May, 2019
• Bachelor of Science in Mathematics and Computer Science, Minor in Economics, In-Major GPA: 3.9/4.0
• Related Coursework: Data Science, Data Structure, Algorithm, Database System, Intro to AI, Data Mathematics, Econometrics, Operating Systems, Principles of Software, Linear Algebra
• Award: Dean’s Honor List, First Prize in First Annual Cognitive and Immersive Data Insights Application Challenge Skills
Languages: Proficient in Python, Mysql, R and C++. Familiar with C, Matlab, Stata, Haskell, Scala and Latex. Technologies: Pandas, Numpy, matplotlib, Linux/Unix, Jupyter Notebook. Research Experiences
Video Anomaly Detection via Gaussian Process(GP), Directed by Professor Cristina Savin Fall, 2019
• Divided each video frame into smaller cells and calculated Histogram of Optical Flows(HoF) using OpenCv in all cells.
• Utilized an online clustering algorithm to construct basic vectors using HoF in training data.
• Fed all basic vectors into a GP with zero mean and RBF kernel with target variable equal to one.
• Computed predictive mean(GPM) and inverse of variance(GPV) of testing data as two anomaly scores.
• In UCSD anomaly detection dataset, GPM and GPV achieved an AUC of 0.857 and 0.835, respectively. NBA Outcome Prediction, Directed by adjunct Professor Brian Dalessandro Fall, 2019
• Scraped stats at both player-level and team-level from 2008 to 2019 from 3 different NBA websites.
• Merged all datasets from different sources and engineered features using Pandas and Numpy
• Selected features in two ways: 1. Chose important features using feature Importance by Random Forest. 2. Dropped most correlated features and then chose important features using Mutual Information with the target variable.
• Ran Random Forest, K-Nearest Neighbors, Logistic Regression, Support Vector Machine(SVM) and Neural Network using scikit-learn and tuned hyper-parameters using Blocking Time Series Split.
• Logistic Regression and SVM achieved the best average AUC 0.708 using features selected by Random Forest. A Web-based App Using Shiny in R, Directed by John S. Erickson, RPI Summer, 2018
• Utilized health dataset based on citizens’ health status from New York State to analyze health status among all counties, found correlations between health and other statistics (lifestyle, medicare, socioeconomic status, etc.)
• Reduced dimensions using Robust PCA which kept the variance and analyzed transformed data.
• Wrote scripts to clean and visualize data in R, with packages such as dplyr, tidyr and ggplot2.
• Developed a web-based app which takes query and outputs visualizations using Shiny. A Method which predicts ASD Based Biomarkers, Directed by Professor Kristin P. Bennett, RPI Spring, 2018
• Dropped correlated features and then normalized remaining features.
• Constructed a model based on Fisher LDA to predict children with Autism Spectrum Disorder (ASD).
• Achieved 85 percent accuracy in 41 testing data points with a 50 percent base rate.