Data Analyst

Location:

Milpitas, CA, 95035

Posted:

July 24, 2017

Contact this candidate

Resume:

Weiqing Yu

********@*****.*** 513-***-****

Objective: Enthusiastic and dedicated data scientist with solid data analysis skills, rich research and working experience seeking Data Analyst and Data Scientist positions. Education

University of Michigan, Ann Arbor, Department of Statistics Sept. 2014-Apr. 2016 Masters, Applied Statistics Relevant Coursework: Machine Learning, Linear Regression, Categorical Data Analysis, Statistical Computing, Data Structure and Algorithm, Statistical Consulting GPA: 3.80 University of Cincinnati, Department of Mathematical Sciences Sept. 2012-Sept. 2014 Bachelor of Science, Mathematics Major GPA: 4.00

Honors: Two years in the Dean’s List, Charles Phelps Taft Senior Thesis Fellowship Capital Normal University, Department of Mathematical Sciences Sept. 2010-Sept. 2012 Bachelor of Science, Mathematics and Applied Mathematics Major Average: 91.2/100 Work Experience

Global Energy Interconnection Research Institute North America Data Scientist May 2016-Present

Project: Missing Data Recovery by Matrix Completion for Smart Meter Data (Python, R, Spark, Scala) l Built distributed missing data recovery package for smart meter data in Spark, which implemented Alternating Least Square

(ALS) algorithm to plug in missing data.

l Developed SAS (Smoothed ALS) algorithm by adding a smoothness regularization term in ALS algorithm, that improved the prediction accuracy and stability of ALS algorithm in real world smart meter data recovery problems. Github page: https://github.com/WeiqingYu/SAS.

l Based on k-NN imputation, developed Cluster-based Best Match Scanning (CBMS) algorithm that reduces the time complexity of k-NN imputation from O n d to O n . d, and significantly improve the accuracy in experiments. Related Paper: Yu, W. & Zhu, W. (2017, June) Cluster-based Best Match Scanning for Large-Scale Missing Data Imputation. Paper presented at the 3rd Meeting of International Conference on Big Data Computing and Communication, Chengdu, China. l Constructed an ensemble model with SAS, CBMS and a Kalman filter-based data recovery algorithm that exceeds the accuracy of most commonly used data recovery algorithms including linear interpolation, random forest imputation, etc. l Built Python software package for the ensemble data recovery model. Project: Smart Grid Big Data Analysis System (PostgreSQL, Spark, Scala, R, Tableau) l Performed data ETL for database with over 2 billion records in PostgreSQL, and migrated an entire database of size over 300GB from Oracle Database to PostgreSQL.

l Built user interface with JavaScript and R which was able to perform k-means, k-subspace and k-shape clustering with user selected parameters and presented the clustering results in Tableau. l Developed a streaming data processing model in Spark that clustered real-time daily power system data with an original online clustering algorithm and stored the clustered data into PostgreSQL database. Related Paper: Zhu, W. & Yu, W. (2017, June) Smart Meter Data Analytics based on Modified Streaming k-Means. Paper presented at the 3rd Meeting of International Conference on Big Data Computing and Communication, Chengdu, China. University of Michigan, Ann Arbor, EECS Department Research Assistant (R, Matlab) May 2015-Aug. 2016

l Imported and pre-processed data from disordered text files in Matlab and built a multinomial regression model in R to distinguish the origin of arrhythmia based on ECG signals. l Improved the accuracy of predecessor’s model by 10% (75% to 85%) through feature engineering and adding LASSO regularization term into the model.

l Clearly presented and explained the results of the classification model to faculty with little mathematics background and visualized the data using Principle Component Analysis (PCA), Coefficient Plots and etc. IBM Global Business Solution Center

Data Analyst Intern (SPSS Modeler, R) Jun. 2014-Aug. 2014 l Cooperated with other data analysts and constructed an ARIMA time series model for oil import and export in SPSS which helped prevent oil depot from being overstored or understored. l Devised and presented a stochastic model for the pipeline condition data that can predict the growth of pitting corrosion and provide early warnings for potential pipeline rupture. Academic Projects

Kaggle Competition: BNP Paribas Cardif Claims Management (Python, R) Feb. 2016-Apr. 2016 l Implemented multiple imputation to plug in missing data and used Principal Component Analysis (PCA) and Restricted Boltzmann Machines to reduce the dimension of data. l Built an ensemble model with Extreme Gradient Boosting classifier, Artificial Neuron Network classifier and Random Forest classifiers to predict the probability of a claim to be approved, whose prediction results ranked among the top 10%. R Package for a High Dimensional Bayesian Method (C++, R) May 2015-Feb. 2016 l Used a Moreau-Yosida approximation scheme for the posterior distribution in a high dimensional Bayesian problem, which can be implemented in high dimensional Bayesian regression and high dimensional precision matrix computation. l Developed an R package for the algorithm in C++ which is capable of parallel computing. R package for Elastic-net model (C++, R, Fortran) Oct. 2015-Dec. 2015 l Built an R package “Cglm” in C++ based on the Fortran source code from the R package “glmnet” which can efficiently fit elastic-net regularization path for generalized linear models. Computer Skills

l Statistical Software: R, Matlab, SPSS Modeler, SAS and Mathematica. l Programming Language: Python, C++, Scala and Fortran. l Database: PostgreSQL, Oracle Database.

l Big Data Processing Framework: Spark, Hadoop.

l Operating System: Linux, Microsoft Windows and macOS.

Contact this candidate