Data Scientist

Location:

Kingston, ON, Canada

Posted:

May 22, 2018

Contact this candidate

Resume:

Zhang, Ru

*** ****** ********

Kingston, ON, K*L*T8

*************@*****.***

613-***-****

To Hiring Manager,

I am interested in applying for the position of data scientist at Incorporan Inc. Montreal. The enclosed resume lists my relevant working experience, selected projects and research publications. I would like to work with statistical models and machine learning algorithms for challenging industrial applications. If you have a moment, I would be delighted to ask you some questions about the type of projects which your institution works on.

Please feel free to contact me by email or telephone if you have any question about my resume. I am also available for a face-to-face interview in Montreal. I look forward for your response. Sincerely,

Ru Zhang

Zhang, Ru Email : *************@*****.***

Mobile : 613-***-**** Address: 115 Wright Crescent, Kingston, ON, Canada Education

Queen’s University Kingston, ON

Ph.D. in Statistics; GPA: 4.17/4.30 Sep. 2014 { Present

Nankai Unversity Tianjin, China

M.S. in Statistics Sep. 2010 { June. 2013

Dongbei University of Finance and Economics Dalian, China B.S. in Applied Mathematics Sep. 2006 { June. 2010 Experience

Queen’s University Kingston, ON

Teaching & Research Assistant Sep. 2014 - Present

Research Assistant: Ph.D. research in statistics focusing on modeling, analysis and the inverse problem of dynamic computer experiments with large-scale data set.

Teaching Assistant: Tutorials, assignment marking and term marking for statistical and mathematical courses.

Annoroad Gene Technology Beijing, China

Data Scientist Mar. 2014 - Sep. 2014

R & D: Machine learning algorithms for preimplantation genetic screening of the Down syndrome.

China First Heavy Industries Tianjin, China

Software Engineer June. 2013 - Mar. 2014

C/C++ development: numerical libraries for solving di erential equations in the process control system of hot rolling machinery.

Projects

Sequential Design for the Inverse Problem in Dynamic Computer Experiments Sep. 2017 - Feb. 2018

Description: a new sequential design algorithm and expected improvement criterion for solving inverse problems in dynamic computer experiments which achieves signi cantly higher accuracy than existing alternatives. The convergence of this algorithm is proved by the theory of reproducing kernel Hilbert space (RKHS).

Keywords: expected improvement, sequential design, RKHS

Languages: R and C

GitHub Repo: https://github.com/heavenmarshal/l2inv

Corporacio n Favorita Grocery Sales Forecasting

Dec. 2017 - Jan. 2018

Description: Kaggle challenge of forecasting the sales of 4000+ grocery items in 54 chain stores between Aug. 16, 2017 to Aug. 31, 2017. The submitted solution consists of 80+ handcrafted features including moving average, periodic mean and standard deviation, promotion and holiday information. Ensemble of gradient boosting models using LightGBM and xgboost is applied.

Result: silver medal (top 3%, ranks 43 of 1675 teams)

Keywords: machine learning, sale forecasting, time series

Languages and Tools: R, C, data.table, dplyr, LightGBM, SQLite and xgboost

Local Approximate SVD-based Gaussian Process (GP) Models Mar. 2016 - Aug. 2017

Description: a new algorithm for selecting neighborhood sets and tting local SVD-based GP models to address the infeasibility of the classic GP on large-scale dynamic experiments. The algorithm is parallelized via R package

\parallel" and OpenMP. A stable version named \DynamicGP" has been published on CRAN.

Keywords: Gaussian processes, nearest neighborhood, parallel programming

Languages: R, C and C++

GitHub Repo: https://github.com/heavenmarshal/lasvdgp

Quora Question Pairs

May. 2017 to Jun. 2017

Description: Kaggle challenge of identifying identical questions posted on Quora in order to merge similar entries. The training set consists of 400K+ question pairs, with tags 1 (identical) or 0 (di erent). The submitted solution uses 40+ handcrafted features plus 200 automatically selected bag of words features. Classi cation is made by gradient boosting models (xgboost).

Result: bronze medal (top 6%, ranks 168 of 3307 teams)

Keywords: classi cation, natural language processing

Language & Tools: Python, NLTK, scikit-learn, Stanford CoreNLP and xgboost

Field-aware Factorization Model Training on Hadoop HDFS May. 2017

Description: transform the lib m data I/O interface to allow directly reading training data from and writing predictions to Hadoop HDFS. Accelerate the work

ow of model training.

Language and Tools: C, Hadoop and HDFS

Research Items

Zhang, R., Lin, C. D., Ranjan, P. (2018) \Local Gaussian Process Model for Large-scale Dynamic Computer Experiments", accepted by Journal of Computational and Graphical Statistics.

Zhang, R., Lin, C. D., Ranjan, P. (2018) \A Sequential Design Approach for the Inverse Problem in Dynamic Computer Experiments" Manuscript.

Skills

Programming Languages: R, C/C++, Python

Operating System: Linux

Machine Learning Libraries: xgboost, LightGBM, NLTK, scikit-learn

Numerical Libraries: BLAS, LAPACK, Eigen

Data Wrangling: data.table, dplyr, pandas, SQLite, MySQL

Text Editing: Emacs, LATEX

Parallel Programming: OpenMP, R package ’parallel’

Version Control: git

Links

My GitHub: https://github.com/heavenmarshal

My Kaggle Pro le: https://www.kaggle.com/heavenmarshal

Contact this candidate