Haowei Ni
*** **** ***** **, *** ***. New York, NY 10025
**********@*****.*** • 619-***-****
EDUCATION
Columbia University, Mailman School of Public Health New York, NY May 2020
Master of Science in Biostatistics, Theory and Methods Track, GPA: 3.5/4.0
Courses: Machine Learning • Data Mining • Database & SQL Programing • Statistics & Probability • Advanced Algebra • Data Science • Biostatistics Methods and General Linear Models • Longitudinal and Survival Data
University of California San Diego San Diego, CA June 2018
Bachelor of Science in Biochemistry and Cell Biology
WORK EXPERIENCE
Weill Cornell Medicine New York, NY Jan 2020
Data Scientist Intern
Conducted web-scrapping to collect additional features and fulfill missing values in Python.
Predicted disease trend with different regression methods including multinomial logistic regression models, penalized models (Lasso/Ridge/Elastic net) and polynomial regression with feature interactions.
Implemented parametric and non-parametric hypothesis tests (two-sample t-test/chi-square test/Fisher-exact/Kruskal-Wallis, etc) to investigate cardiovascular disease related risk factors.
Performed unsupervised learning including K-means, PCA, clustering to group coronary lesion similarities.
Global AI Corporation New York, NY Jun 2019 to Sep 2019
Data Scientist Intern
Applied Natural Language Processing techniques, such as Bag of Words model, document clustering and topic modeling, to perform feature engineering on large-scale unstructured text data using Python.
Translated files from more than 10 languages into English and made heatmaps to evaluate the degree of emphasis on each topic provided in the UNGC Taxonomy for over 10,000 companies.
Analyzed Argentinian news data by building autoregression models for time series analysis and forecasting of the news tone and visualized the trends in Tableau interactive dashboard.
Classified arbitrary unlabeled messages by training logistic regression, XGBoost, SVM and Naive Bayes models.
Developed automation packages for text data cleaning on 5 million rows data and stored into pickle files.
PROJECTS
Columbia University New York, NY Jan 2019 to May 2019
Beijing Airbnb Price Range Analysis
Built machine learning models (Neural Network/Decision Tree and Random Forest/Adaptive Boosting/Stochastic Gradient Boosting/XGBoost/LightGBM) to predict price range in Python.
Preprocessed data with missing value extrapolation, collinearity removal and resampling with Boostrap.
Performed stepwise regression, and PCA to reduce model complexity and training time.
Evaluated and enhanced model accuracy by adjusted R2, MSE, AIC, cross-validation (K-fold, LOOCV) and tuning hyperparameters.
Generated keywords extraction using TF-IDF and N-grams algorithms and conducted sentiment analysis, collocation for 200,000+ reviews to provide insights to improve customers’ satisfaction in Python (genism).
Spotify Saved Songs Profile Exploration Sep 2018 to Dec 2018
Created datasets from user’s saved songs library and extracted audio features using Spotify Web API.
Analyzed datasets by performing t-SNE to reduce feature dimensions and used unsupervised learning algorithm (K-means clustering) to group songs with similar characteristic and profile clusters with radar charts.
SKILLS
PROGRAMMING AND SOFTWARE: Python (Pandas, Numpy, Scipy, Scikit-Learn, Matplotlib, BeautifulSoup), R (Tidyverse, Shiny, Leaflet), SQL, Google BigQuery, Tableau, Excel
Data Mining: Linear Regression (Penalized and Un-penalized), Clustering and Classification (K-means, Hierarchical, PCA, Random Forest), Logistic Regression, LDA, QDA, GAM, Tree based Methods, XGBoost, Time Series, Neural Network, SVM, A/B Testing
Certificate: SAS Certified Base Programmer for SAS 9