Hao Yin
774-***-**** adgq36@r.postjobfree.com *** Institute Road, Worcester, MA 01609
EDUCATION
WORCESTER POLYTECHNIC INSTITUTION Worcester, MA
Master of Science in Data Science ( GPA 3.81/4.0 ) May 2020 JINAN UNIVERSITY Guangdong, China
Bachelor of Science in Economics (Financial engineering direction) (GPA 3.7/4.0) Jun 2018 PROFESSIONAL EXPERIENCES
CMIND INC., Data Analyst Jun 2020 - Present
US Secondary Market Fraud Detection And Portfolio Construction
● Programmed Web Crawler to extract over 10GB daily financial data of all US public companies from yahoo finance
● Performed feature engineering by creating new features based on finance domain knowledge and results from NLP
● Imputed missing values with the KNN algorithm and combined downing sampling with synthetic minority oversampling
(SMOTE) algorithm to deal with highly imbalanced dataset
● Conducted fraud detection in the secondary market using Naive bayes, logistic regression, SVM, Decision Tree, Random Forest, and XGBoost models
● Optimized hyper-parameters of the random forest using random search technique, improved accuracy by 1.5, 0.9 in AUC, the absolute value from 10-fold cross-validation compared with the best result without tuning hyper-parameters
● Constructed portfolios based on selected fraud signals, which were generated by ML models and measured the signal performance. On average, the low-risk portfolio outperformed the S&P 500 Index by 3% per year
● Analyzed return attribution by sector, market cap etc with Plotly and provided reports to the clients and managers KRONOS INC., Data Scientist Coop Jan 2020 - May 2020 Customer Behavior Pattern Recognition And Anomalous Activities Detection
● Performed exploratory data analysis on ~3GB workforce management data including numerical, categorical, and time series data using Matplotlib and Seaborn
● Processed raw dataset including feature extraction, outlier detection, variable standardization with Python
● Constructed dynamic time warp model to measure the time series similarity and produced distance matrix on 120 features
● Designed a multi-variance time series clustering model, DTW with hierarchical clustering, to get customers’ usage pattern and measured the performance of clustering by silhouette coefficient
● Explored the characteristics of customers in each cluster by dynamic time warp barycenter averaging method
● Developed framework to consolidate multi-relational database of product attributes using advanced SQL Querying
● Trained anomaly detection models with both unsupervised machine learning, isolation forest and supervised machine learning, linear regression, lasso, ridge, random forest, etc to build customer’ anomalous activities alert system
● Built dashboards to describe the customer's’ usage activities via Tableau and weekly reported analysis results AUDIOCN INC., Data Scientist May 2019 - Sep 2019
Personalized Music Recommendation System On Imbalanced Dataset
● Extracted ~10GB data from the Hadoop file system and wrote a data ELT pipeline in SQL and Pyspark scripts
● Construct user profiles by generating over 50 customer usage pattern features over 1 million customers
● Designed Collaborative Filtering, fine-tuned Random Forest and Gradient Boosting Models using SparkMlLib and used Grid Search to find better hyperparameters that optimize the models and prevent overfitting
● Analyzed confusion matrix and achieved 87.5% accuracy for fine-tuned Random Forest model
● Collaborated with the software development team in deploying models on a new app version and designing A/B testing framework, key metrics of the tests. The result shows the convention rate has increased by 20 percent
● Visualized research results using Tableau and conducted weekly meetings, presented results using Powerpoint SELECTED PROJECTS
Deep Learning(Pytorch): Judging the Aesthetic Value of Pictures, WPI Aug 2019 - Dec 2019
● Used AVA dataset to create training sets with 51,106 images labeled as good-looking, normal-looking, and bad-looking.
● Designed 64 features including picture blurriness, brightness, etc with image processing techniques based on GIST, Hu moments, and color histogram in OpenCV framework.
● Optimized hyper-parameters of SVM using Bayesian Optimization in Python, achieving 77.5% accuracy on test sets.
● Built Convolution Neural Networks (CNN) in Pytorch using images as raw inputs, analyzed confusion matrix and achieved 88.3% accuracy with 60% as a baseline.
SKILLS AND INTERESTS
● Programming:Python(Scikit-Learn, Matplotlib, Seaborn), R, SQL, Spark (Scala, PySpark), Hadoop, D3/React, VBA
● Analysis & Modeling:Statistical Modeling, Machine learning, Deep learning, Reinforcement learning, A/B Testing
● Analysis Tools:Tableau, SPSS, Eviews, Weka, Microsoft Office