Data Scientist

Location:

Boston, MA

Salary:

80k

Posted:

November 05, 2019

Contact this candidate

Resume:

Siyu Xiao

Position Objective: Data Scientist & Analyst

Malden, MA, 02148 ****.**@*****.***.*** 781-***-**** EDUCATION

Northeastern University, Boston, MA (GPA: 3.64) Sep.2017 - Jul.2019 Master of Analytics – Statistical Modeling (STEM)

Courses: Machine Learning, Probability Theory and Statistics, Data Mining, Data Visualization, Big Data, Data Warehouse Harbin University of Science and Technology, China Aug.2012 - Jun.2016 Bachelor of Engineering – Measure and Control Technology Courses: Complex Function, Computer Principle, Linear Algebra, Calculus, Digital Image Processing, Algorithm Introduction TECHNICAL SKILLS

Software Tools: Python, R, SQL, MATLAB, C, Lisp, Tableau, Excel, SPSS, Spark, RapidMiner, Linux, Neo4j, Orange Framework and Platform: PyTorch, TensorFlow, Keras, Scikit-learn, AWS, Azure, Cloudera, Databricks EXPERIENCE

VitaData.io Cambridge, MA

Data Scientist Oct.2018 - Mar.2019

• Crawled cryptocurrencies (like Bitcoin) exchange records with Beautiful Soup and Requests libraries, converted it to a csv data frame. Combined with other confidential data, calculated the Correlation Coefficient of each variable in Excel. Predicted the price of cryptocurrencies with Regression Tree and Linear Regression in MLlib of Spark with accuracy around 75%.

• Analyzed comment sentiment of cryptocurrency market. Used pickle to binary serialize data, applied jieba to segment Chinese texts, calculated TFIDF to select important words, applied NLTK to tokenize English texts and get synonyms of the words.

• Used GRU (Gated Recurrent Unit) to predict sentiment with the accuracy about 90%. Input analyzed results with cURL or used py2neo as a RESTful API to input data into Neo4j database. Made an app about sentiment influence on cryptocurrency with R Shiny.

• Used KNN and Decision Tree to detect breast cancer more efficiently, used z-score standardization or normalization to find nearest neighbors. With different k values, the highest diagnosing predictive Recall and Precision were around 95% and 92%.

• Predicted the power consumption based on hospitals real-time condition. Mined the data like ranking power-consuming activities. Used RNN in Keras from TensorFlow to predict the energy usage and applied Dropout to regularize the model. Used Seaborn to visualize line figures of predicted and real energy consumption with different colors. The prediction accuracy was about 83%. Information Shield LLC Beijing, China

Data Analyst Sep.2016 - Aug.2017

• Mined and visualized the characteristics of the randomly sampled five million rows of training data with Tableau. Used NumPy and Pandas to preprocess data, applied Logistic Regression to detect whether a network connection is dangerous.

• Used threshold moving to solve dataset imbalance problem, moved threshold to the class-prior probability to make a classifying decision, used Bootstrap Aggregating (also called Bagging) to improve the stability of the models.

• Applied Microsoft Azure ML Studio to train all the 50 million rows of data with Logistic Regression plus Random Forest, drew ROC figure and calculated AUC with the result of 0.9, compared with Logistic Regression, the AUC improved 0.15. PROJECTS

GE Aviation cooperating project: data mining and predictions about security monitor data Apr.2019 - Jul.2019

• Merged all the datasets into one by Primary Key, which made it easier to be analyzed. Processed outlier, missing and useless data.

• Drew the Entity-Relationship Diagrams like Snowflake Schema with Normal Form in MySQL Workbench, made Exploration of Data Analysis for the merged dataset, visualized the rank of indicators that trigger security alerts with Matplotlib.

• Embedded and normalized the string data into the structure that can be inputted into a machine learning model, used Supervised Learning algorithms in Scikit-learn including XGBoost, SVM and Logistic Regression to classify the dataset with the highest Recall about 96% and Precision about 95% which helped GE company to find the employees who had malicious intent. Predicted the loan applicants repaying ability for Citizens Bank Jun.2018 - Sep.2018

• Applied the Entity Embedding method to automatically learn the representation of structured data in multi-dimensional space, which is better than one-hot encoding and manually designed features based on experience.

• Used PyTorch library, applied Recurrent Neural Network layer (LSTM) to process time-series data, applied dense layer to process fixed-length data, merged dense layer and LSTM hidden layer after feature extraction, then used two dense layers to get output layer, calculated loss with Binary-Cross-Entropy (BCELoss).

• Learned the parameters of all the networks with Stochastic Gradient Descent. Made cross-validation to improve the predicting accuracy that whether an applicant can repay the loan to about 80% and visualized all the results with Plotly. Recommended and filtered ads for Viacom Inc. Jan.2018 - Apr.2018

• Used K-means algorithm to cluster the market segments of Viacom social channels data, targeted ads to users who had the same interests, which improved 11% of the ads effect. Visualized the clustering data and ads effect with ggplot2.

• Transformed original text into the words and visualized them with wordcloud package. Applied Naïve Bayes algorithm to filter the ads with the Precision about 92% and Sensitivity about 93%.

• Applied SageMaker in AWS to train the data with the same models which got the same results but saved half of running time.

Contact this candidate