Data Python

Location:

New York City, NY, 10025

Salary:

90000

Posted:

February 12, 2020

Contact this candidate

Resume:

Haowei Ni

*** **** ***** **, *** ***. New York, NY 10025

**********@*****.*** • 619-***-****

EDUCATION

Columbia University, Mailman School of Public Health New York, NY May 2020

Master of Science in Biostatistics, Theory and Methods Track, GPA: 3.5/4.0

Courses: Machine Learning • Data Mining • Database & SQL Programing • Statistics & Probability • Advanced Algebra • Data Science • Biostatistics Methods and General Linear Models • Longitudinal and Survival Data

University of California San Diego San Diego, CA June 2018

Bachelor of Science in Biochemistry and Cell Biology

WORK EXPERIENCE

Weill Cornell Medicine New York, NY Jan 2020

Data Scientist Intern

Conducted web-scrapping to collect additional features and fulfill missing values in Python.

Predicted disease trend with different regression methods including multinomial logistic regression models, penalized models (Lasso/Ridge/Elastic net) and polynomial regression with feature interactions.

Implemented parametric and non-parametric hypothesis tests (two-sample t-test/chi-square test/Fisher-exact/Kruskal-Wallis, etc) to investigate cardiovascular disease related risk factors.

Performed unsupervised learning including K-means, PCA, clustering to group coronary lesion similarities.

Global AI Corporation New York, NY Jun 2019 to Sep 2019

Data Scientist Intern

Applied Natural Language Processing techniques, such as Bag of Words model, document clustering and topic modeling, to perform feature engineering on large-scale unstructured text data using Python.

Translated files from more than 10 languages into English and made heatmaps to evaluate the degree of emphasis on each topic provided in the UNGC Taxonomy for over 10,000 companies.

Analyzed Argentinian news data by building autoregression models for time series analysis and forecasting of the news tone and visualized the trends in Tableau interactive dashboard.

Classified arbitrary unlabeled messages by training logistic regression, XGBoost, SVM and Naive Bayes models.

Developed automation packages for text data cleaning on 5 million rows data and stored into pickle files.

PROJECTS

Columbia University New York, NY Jan 2019 to May 2019

Beijing Airbnb Price Range Analysis

Built machine learning models (Neural Network/Decision Tree and Random Forest/Adaptive Boosting/Stochastic Gradient Boosting/XGBoost/LightGBM) to predict price range in Python.

Preprocessed data with missing value extrapolation, collinearity removal and resampling with Boostrap.

Performed stepwise regression, and PCA to reduce model complexity and training time.

Evaluated and enhanced model accuracy by adjusted R2, MSE, AIC, cross-validation (K-fold, LOOCV) and tuning hyperparameters.

Generated keywords extraction using TF-IDF and N-grams algorithms and conducted sentiment analysis, collocation for 200,000+ reviews to provide insights to improve customers’ satisfaction in Python (genism).

Spotify Saved Songs Profile Exploration Sep 2018 to Dec 2018

Created datasets from user’s saved songs library and extracted audio features using Spotify Web API.

Analyzed datasets by performing t-SNE to reduce feature dimensions and used unsupervised learning algorithm (K-means clustering) to group songs with similar characteristic and profile clusters with radar charts.

SKILLS

PROGRAMMING AND SOFTWARE: Python (Pandas, Numpy, Scipy, Scikit-Learn, Matplotlib, BeautifulSoup), R (Tidyverse, Shiny, Leaflet), SQL, Google BigQuery, Tableau, Excel

Data Mining: Linear Regression (Penalized and Un-penalized), Clustering and Classification (K-means, Hierarchical, PCA, Random Forest), Logistic Regression, LDA, QDA, GAM, Tree based Methods, XGBoost, Time Series, Neural Network, SVM, A/B Testing

Certificate: SAS Certified Base Programmer for SAS 9

Contact this candidate