Sign in

Data Scientist

Calistoga, CA
October 22, 2019

Contact this candidate






Shaohua Zhang


Washington University in St. Louis Aug 2017 - Dec 2019 Master: Engineering Robotics & Electrical Engineering (Double major) (STEM) GPA: 3.73 / 4.0 Relevant Coursework:Artificial Intelligence, Machine Learning, Data Mining, Cloud Computing with Big Data Xi'an Jiaotong University (Top five in China) Sep 2012 - Jul 2016 Bachelor: Electrical Information Engineering GPA: 91/ 100 Data Science: Machine Learning, Data Preprocessing, NLP, ETL, Data Visualization (Tableau, Power BI), Deep Learning, Feature Engineering, Predictive Modeling, Data Warehouse, Hadoop (HDFS, MapReduce), Spark, Azure, Data Reporting

Statistical Analysis: Hypothesis Testing (A/B Test), time-series analysis (ARIMA), Excel (pivot tables, VLOOKUP) Programming Language: Python (NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn, Plotly, Pytorch, TensorFlow), SQL, Hive, R, Java, MATLAB

Software Engineer, Machine Learning (Intern) May 2019 - Aug 2019 KainaCapital Shanghai, China

Extracted data using MySQL from KainaCapital's database, which consists of 9000+ China equities information and designed signal factor to manage portfolio

Created a data pipeline to perform sentiment analysis based on traders' Tweets using StockTwits Preprocessed text by tokening, stemming and stopwords removing, and used TF-IDF to extract features Trained unsupervised learning modes of K-means clustering and Latent Dirichlet Allocation(LDA), and visualized model training result by dimensionality reduction using Principal Component Analysis (PCA) Trained and compared supervised machine learning models including Naive Bayes Classifier, Logistic Regression, Random Forest and Gradient Boosting to predict the mood of traders based on their messages Expose common risk factors to investment group and saved them 2 hours per day Teaching Assistant for Machine Learning Jan 2019 - May 2019 Washington University in St. Louis St. Louis, MO

Took responsibility for grading assignments (8 hours per week, 87 students), class participation, and exams Maintained weekly office hours and problem-solving sessions (2 hours per week) for helping students to have a deep understanding of both supervised and unsupervised machine learning algorithms Machine Learning - Heart Disease Prediction with Cleveland Clinic Dataset

Feb 2019 - Apr 2019

Developed algorithms to predict whether a person has heart disease or not via Python and Apache Spark Preprocessed dataset by data cleaning, missing value imputation and feature engineering with Spark SQL Used SMOTE to solve data imbalance issue, trained Random Forest model and Support Vector Machine (SVM-RBF kernel) model and applied regularization with optimal parameters to avoid overfitting Evaluated model performance of classification and improved the precision to 84.5% Compared two different methods using 10 re-runs of a 10-fold cross-validation and performed statistical test (t- test) to assess whether one of them performs significantly better than the other Applications Development - Online Ordering App (Android) Feb 2019 - Mar 2019 Created an online ordering app, which is similar to Uber Eats or DoorDash Utilized XML and Kotlin to develop vivid UI and designed cute icon (front end) Built SQLite database to allow customs to save the order locally and employed Firebase to enable restaurant owners to upload pictures and information of the menu (back end) Implemented Google Map API to realize positioning and navigation functions Designed the online payment methods with either credit cards or Paypal NLP - Neural Machine Translation with RNN (Spanish -> English) Oct 2018 - Dec 2018 Built a Seq2Seq Model with Multiplicative Attention (Pytorch) Used a Bidirectional LSTM Encoder and a Unidirectional LSTM Decoder (Both hidden size and embedding size were 256), dropout the output vector with probability 0.3 Trained the model in VM from the Microsoft Azure Web Portal (about 4 hours) Evaluated model with corpus BLEU (Bilingual Evaluation Understudy) Score and improved 7% accuracy compared with old version

Artificial Intelligence - Pacman Searching Algorithm Apr 2018 - May 2018 Implemented depth-first, breadth-first, uniform cost, and A* search algorithms to solve navigation and traveling problems in the Pacman world with python

Optimized searching algorithms and reduced the searching time by 20% Visualized the route of Pacman in maze

Chicago Crime Analysis in Apache Spark Nov 2017 - Dec 2017 Performed spatial analysis for data from 2001 to 2018 reported on Chicago Police Department Constructed data processing pipeline based on existing dataframe and Spark SQL for big data OLAP Built and visualized model of spatial distribution of incidents through K-means clustering Trained and fine-tuned an ARIMA model to predict the number of incidents in the future Summarized crime distribution and trends in Chicago and suggested the solution to solve the existing problems

Contact this candidate