Qiankun Huang
! *****.*****@*****.***.***
Looking'for'a'Data'scientist'position'
'
Northeastern University Boston, MA
Master degree in Applied Math Dec 2017
Sichuan University Chengdu, China
Bachelor degree in Math and Economics June 2015
SKILLS
•! Professional skills:
Python, R, SQL, Hadoop, Spark, Matlab, Java
•! Machine learning skills
Linear, Logistic, KNN, SVM, Random Forest, GBDT, K-means, PCA, TF-IDF, LDA, MCMC, ALS, CNN ARIMA
Project Experience
Telephone churn prediction model
•! Predicted whether the!service would be canceled based on customer behavior data.
•! Preprocessed data by data cleaning, categorical feature transformation, standardization and feature correlation with Python Sklearn, pandas, numpy Library.
•! Implemented K-fold Cross-Validation to compare Logistic, Random Forest and KNN model. The best model was Random Forest model. The accuracy was 0.94, the precision was 0.99 and the recall was 0.95. Clustering movies with natural language processing
•! Preprocessed movie scripts by keywords Tokenizing, Stemming and Calculated the TF-IDF matrix.
•! Conducted the K-means method and Latent Dirichlet Allocation for document clustering with Python Sklearn and nltk Library.
•! Developed visualization tools for result’s evaluation. LDA worked more efficient than K-means according to instability.
Movie recommendation system (Big Data Project)
•! Transformed millions of customer’s review data into Spark SQL tables and did queries OLAP to better understand the data.
•! Utilized collaborative filtering ALS method to fill the sparse rating matrix under Spark-ML library, found the best hyper-parameters by cross-validations. The best model MSE was 0.24.
•! Combined ALS’s result, item-based and user-based method to recommend Top 3 rated movies for the user. Bankruptcy companies’ prediction
•! Predicted a company would bankrupt or not and found the most important feature among 64 financial features.
•! Processed millions of data by standardization and fixed thousands missing value and outliers.
•! Implemented cross-validation to divide data and evaluated SVM, Random forest, Logistic and KNN by recall, precision. The best model was Random forest with 0.99 recall and 0.99 precision.
•! Calculated the parameters for Logistic and out of bag error for Random forest. The most important feature was Gross profit.
Carbonate formation induced by CO2 (National Science Project)
•! Utilized DRAM MCMC model to find confidential interval for over 300 parameters’ unsolved differential functions with over million data points.
•! Compared DRAM, Gibbs sampling, and MH method. After 10000 iterations, the DRAM MCMC’s test result was over 85% accurate and far more effective than others.
•! Tried to improve DRAM method by changing the lost function, the assumed hyper-distribution and the threshold. Finally, improved accuracy to over 93%. Internship Experience
Woods Hole Ocean Institution Falmouth, MA
Research Assistant
•! Preprocessed with millions experimental data, fixed the missing value and outlier problem.
•! Design hypothesis test to test data reliability and developed tools for data visualization.
•! Assisted with Carbonate Formation Induced by CO2 (National Science Project)