Resume

Machine learning data scientist

Location:

Colorado Springs, CO

Posted:

August 03, 2017

Contact this candidate

Resume:

Tri Doan

e-mail: ac1m7d@r.postjobfree.com

linkedin connection: https://www.linkedin.com/in/doan-tri-64363439

DATA SCIENTIST

Qualification

Passion for cutting edge technology to help drive informed business decisions. Results-oriented with R & D experience in computer science field with a strong focus on data mining in data mining problems, particularly fraud and anomaly detection and time series analytics . Experience with advanced machine learning algorithms and data mining tool kits in Python and R. Knowledgeable of Azure Machine Learning, H2o and Spark. Research interests include model selection with optimal hyper-parameters search, ensemble methods with homogeneous and heterogeneous designs. Interested in learning and working in finance sector.

Ability to visualize data with matplotlib, ggplot and knowing Tableau, particularly for Exploratory Data Analysis (EDA) tasks. Familiar with database systems such as mySQL, MongoDB (NoSQL). Using Java, C++ for project works at school and in a teaching assistant job. A dedicated mentor in Research Experience for Undergraduate (REU) prgram funded by National Science Foundation (NSF) in during the last three machine learning camps organized by University of Colorado Colorado Springs. Team work with other fellow student research topic: authentication of mobile device for risk analysis.

Research of interest:

Ensemble learning method for classification and regression problems

Abnomal detection

Time series regression

Deep Learning (especially RNN and LSTM models )

Recommend system

Data analysis

Education

University of Colorado, Colorado Springs (UCCS)

Awards: Graduate scholarships (2015-2016,2016-2017)

Student Travel awards

Graduated May 2017

Kansas State University (KSU)

Award: Vietnam-USA scholarship (2010-2012)

Graduated May 2012

Career Experience

UCCS

Research Assistant, Linclab: study solution for varied data mining projects: algorithm selection, continuous data, open set problem

Teaching Assistant

oJava, C++, Database with SQL, Analysis Algorithm

Research Assistant for REU (Research Experienced for Undergraduate): teaching data mining tools such as Azure machine learning, Rapid miner,R

Fall 2013 – May 2017

Fall 2013- May 2017

Summer 2014,2005,2016

Dissertation for Phd degree: Ensemble Learning on multiple data mining problems focuses on specific problems of big data (as alternative solution for distributed computing like Spark). More details are given at the end of this resume.

Recommend model

Ensemble incremental learning to deal with streaming data and limitation of memory system

Ensemble model for solving problem of unknown classes in real world application

Improving the performance of Ensemble learning with combiner approach.

These topics are main part of my research during Phd study which presents in following publication

Research

2017

“Breaking a challenge for text classification in open world recognition” accepted paper on IEEE CCWC 2017(Annual Computing and Communication Workshop and Conference), Jan 9 -11 Las Vegas

2016

“Sentiment Analysis of Restaurant on Yelp review” accepted short paper on

IEEE ICMLA (International Conference Machine Learning and Application), Dec 18-20 Anaheim, CA.

“Algorithm Selection using performance and run time behavior”

Regular paper, AIMSA 2016 (International Conference on Artificial Intelligence: Methodology, Systems, Applications), Sept 3-5, Bulgaria.

“Predicting run time of classification algorithms under meta-learning approach” Regular paper on Journal Machine Learning and Cybernetic. June 2016

2015

“Selecting Machine Learning Algorithms using Regression Models”

Workshop paper, IEEE ICDM (International Conference on Date Mining), 14-17 November, Atlantic city, New Jersey.

Certificates

Course name

Date Issued

Neural Networks for Machine Learning by University of Toronto

Jan 2017

CS1156x: Learning From Data (Machine Learning) by Caltech

Dec 2016

Text Retrieve and Search Engine by University of Illinois at Urbana-Champaign

Dec 2016

Text Mining and Analytics by University of Illinois at Urbana-Champaign

Nov 2016

Machine Learning by Stanford University

Oct 2016

Cluster Analysis in Data Mining by University of Illinois at Urbana-Champaign

Oct 2016

Machine Learning: Clustering & Retrieval by University of Washington

August 2016

Data Visualization and Communication with Tableau by Duke University

June 2016

Managing Big Data with MySQL by Duke University

May 2016

Machine Learning: Classification by University of Washington

March 2016

Machine Learning: Regression by University of Washington

Dec 2015

Machine Learning Foundations: A Case Study Approach by University of Washington

Nov 2015

CS190.1x: Scalable Machine Learning by University California Berkeley

August 2015

Practical Machine Learning by Johns Hopkins University

August 2015

R Programming by Johns Hopkins University

June 2014

Details in research conducted during my PhD candidacy at University of Colorado Colorado are described as following:

The goal is to find the alternative solution for distributed computing using a top-down approach to address two common problem of a big data: training data is larger than the memory capacity and the training data evolves over time.

First stage of research aims to find a solution for the selecting data mining algorithm given data (assuming that data source sbreakdown in small chunks to fit memory computer constrainst). My solution is to build a recommend system using a regression model. Original data (often with different number of features) will be transformed into a same format, invariant features using dimensional reduction techniques. At the time of publised research paper, I explored only PCA, Kernel PCA, Factor Analysis (unsupervised) and Partial Least Squares (supervised). Later, I explore the use of t-SNE (t-Distributed Stochastic Neighbor Embedding), auto-encoder as more advantage techniques. New data features are generated by statisitcal summaries and using Box-Cox transformation to adapt to linear regression assumption. For regression model, I used Multivariate adaptive regression spline (MARS) and CUBIC regression. Note that our assumption is to less likely to guarantee the same distribution for each sampling subsets such that the task of choosing ML algorithm for each subset data can benefit by the first stage of research. My unpublished research involves in combination approach for all outputs from all ML models for each subsample data where I suggest to apply Mutli-arm bandit algorithm (commly in A/B testing) for the combination of ensemble members. R is only used in this project. I apply statistical test for multiple algorithms on multiple dataset with non-parametric Frieman test.

My second stage of research deal with a big data using incremental learning model. My solution is building an incremental Random Forests which inspires from Mondrian splitting approach. Main work focuses on how to monitor the performance of each ensemble member to pick up the candidate for removal and how to decide the split of a particular node on the fly. Python is an mainly used programming language in this project while I use R for statistical analysis. We compare to other current online approach like Online Random Forests (popular in computer vision task), Hoeffding Tree and using incremental Naïve Bayes as a base classifier. Only the result of experiments on NLP (Yelp review dataset) is published. For statistical analysis, I use R.

My third stage of research deals with one particular problem which results in a new trending research known as Open Set Recognition. The idea is that traditional ML assumption may not be guaranteed in the real world problem when we do not have sufficient knowledge of training classes. It may cause by a data source such as streaming data when data may evolve and present a new class on time line. For example, it is a case of unknown classes in traning is presented as testing in object detection. Other example is in forensic authorship attribution when the task is to detect whether or not unknown text belong to one of suspected authors. Theory involves Weibull distribution and Mahalanobis distance metric. I show that Mahalanobis distance is more appropriate in this work and anomaly detection application as well due to its invariant and unitless. My building model is incrementla ensemble model but it is not developed in previous work during some limitation of Mondrian process for reliable output. Instead, I developed modified version of Nearest Class Mean (original version known as Richio algorithm in Information Retrival). Main points are using clustering approach (DBCAN) for sphere boundary class search and a new strategy for creating new boundary. Only text classification experiment is reported in published paper with presentation due to the free charge on number of papers. Python is used entirely in the project including statistical analysis and ad-hoc test. The remain experiments include a comparison with numeric data sets and results in case of offline learning with neural network approach (convolutional neural network, recurrent network).

My research also include meta-analysis on microarray data sets. Beyond that I have collaboration with one PhD candidate in anomaly detection with mobile device.

Contact this candidate