Post Job Free

Resume

Sign in

Data Analyst Machine

Location:
Worcester, MA
Posted:
October 07, 2020

Contact this candidate

Resume:

Hao Yin

774-***-**** adgq36@r.postjobfree.com *** Institute Road, Worcester, MA 01609

EDUCATION

WORCESTER POLYTECHNIC INSTITUTION Worcester, MA

Master of Science in Data Science ( GPA 3.81/4.0 ) May 2020 JINAN UNIVERSITY Guangdong, China

Bachelor of Science in Economics (Financial engineering direction) (GPA 3.7/4.0) Jun 2018 PROFESSIONAL EXPERIENCES

CMIND INC., Data Analyst Jun 2020 - Present

US Secondary Market Fraud Detection And Portfolio Construction

● Programmed Web Crawler to extract over 10GB daily financial data of all US public companies from yahoo finance

● Performed feature engineering by creating new features based on finance domain knowledge and results from NLP

● Imputed missing values with the KNN algorithm and combined downing sampling with synthetic minority oversampling

(SMOTE) algorithm to deal with highly imbalanced dataset

● Conducted fraud detection in the secondary market using Naive bayes, logistic regression, SVM, Decision Tree, Random Forest, and XGBoost models

● Optimized hyper-parameters of the random forest using random search technique, improved accuracy by 1.5, 0.9 in AUC, the absolute value from 10-fold cross-validation compared with the best result without tuning hyper-parameters

● Constructed portfolios based on selected fraud signals, which were generated by ML models and measured the signal performance. On average, the low-risk portfolio outperformed the S&P 500 Index by 3% per year

● Analyzed return attribution by sector, market cap etc with Plotly and provided reports to the clients and managers KRONOS INC., Data Scientist Coop Jan 2020 - May 2020 Customer Behavior Pattern Recognition And Anomalous Activities Detection

● Performed exploratory data analysis on ~3GB workforce management data including numerical, categorical, and time series data using Matplotlib and Seaborn

● Processed raw dataset including feature extraction, outlier detection, variable standardization with Python

● Constructed dynamic time warp model to measure the time series similarity and produced distance matrix on 120 features

● Designed a multi-variance time series clustering model, DTW with hierarchical clustering, to get customers’ usage pattern and measured the performance of clustering by silhouette coefficient

● Explored the characteristics of customers in each cluster by dynamic time warp barycenter averaging method

● Developed framework to consolidate multi-relational database of product attributes using advanced SQL Querying

● Trained anomaly detection models with both unsupervised machine learning, isolation forest and supervised machine learning, linear regression, lasso, ridge, random forest, etc to build customer’ anomalous activities alert system

● Built dashboards to describe the customer's’ usage activities via Tableau and weekly reported analysis results AUDIOCN INC., Data Scientist May 2019 - Sep 2019

Personalized Music Recommendation System On Imbalanced Dataset

● Extracted ~10GB data from the Hadoop file system and wrote a data ELT pipeline in SQL and Pyspark scripts

● Construct user profiles by generating over 50 customer usage pattern features over 1 million customers

● Designed Collaborative Filtering, fine-tuned Random Forest and Gradient Boosting Models using SparkMlLib and used Grid Search to find better hyperparameters that optimize the models and prevent overfitting

● Analyzed confusion matrix and achieved 87.5% accuracy for fine-tuned Random Forest model

● Collaborated with the software development team in deploying models on a new app version and designing A/B testing framework, key metrics of the tests. The result shows the convention rate has increased by 20 percent

● Visualized research results using Tableau and conducted weekly meetings, presented results using Powerpoint SELECTED PROJECTS

Deep Learning(Pytorch): Judging the Aesthetic Value of Pictures, WPI Aug 2019 - Dec 2019

● Used AVA dataset to create training sets with 51,106 images labeled as good-looking, normal-looking, and bad-looking.

● Designed 64 features including picture blurriness, brightness, etc with image processing techniques based on GIST, Hu moments, and color histogram in OpenCV framework.

● Optimized hyper-parameters of SVM using Bayesian Optimization in Python, achieving 77.5% accuracy on test sets.

● Built Convolution Neural Networks (CNN) in Pytorch using images as raw inputs, analyzed confusion matrix and achieved 88.3% accuracy with 60% as a baseline.

SKILLS AND INTERESTS

● Programming:Python(Scikit-Learn, Matplotlib, Seaborn), R, SQL, Spark (Scala, PySpark), Hadoop, D3/React, VBA

● Analysis & Modeling:Statistical Modeling, Machine learning, Deep learning, Reinforcement learning, A/B Testing

● Analysis Tools:Tableau, SPSS, Eviews, Weka, Microsoft Office



Contact this candidate