Tri Doan
e-mail: ac1m7d@r.postjobfree.com
linkedin connection: https://www.linkedin.com/in/doan-tri-64363439
DATA SCIENTIST
Qualification
Passion for cutting edge technology to help drive informed business decisions. Results-oriented with R & D experience in computer science field with a strong focus on data mining in data mining problems, particularly fraud and anomaly detection and time series analytics . Experience with advanced machine learning algorithms and data mining tool kits in Python and R. Knowledgeable of Azure Machine Learning, H2o and Spark. Research interests include model selection with optimal hyper-parameters search, ensemble methods with homogeneous and heterogeneous designs. Interested in learning and working in finance sector.
Ability to visualize data with matplotlib, ggplot and knowing Tableau, particularly for Exploratory Data Analysis (EDA) tasks. Familiar with database systems such as mySQL, MongoDB (NoSQL). Using Java, C++ for project works at school and in a teaching assistant job. A dedicated mentor in Research Experience for Undergraduate (REU) prgram funded by National Science Foundation (NSF) in during the last three machine learning camps organized by University of Colorado Colorado Springs. Team work with other fellow student research topic: authentication of mobile device for risk analysis.
Research of interest:
Ensemble learning method for classification and regression problems
Abnomal detection
Time series regression
Deep Learning (especially RNN and LSTM models )
Recommend system
Data analysis
Education
University of Colorado, Colorado Springs (UCCS)
Awards: Graduate scholarships (2015-2016,2016-2017)
Student Travel awards
Graduated May 2017
Kansas State University (KSU)
Award: Vietnam-USA scholarship (2010-2012)
Graduated May 2012
Career Experience
UCCS
Research Assistant, Linclab: study solution for varied data mining projects: algorithm selection, continuous data, open set problem
Teaching Assistant
oJava, C++, Database with SQL, Analysis Algorithm
Research Assistant for REU (Research Experienced for Undergraduate): teaching data mining tools such as Azure machine learning, Rapid miner,R
Fall 2013 – May 2017
Fall 2013- May 2017
Summer 2014,2005,2016
Dissertation for Phd degree: Ensemble Learning on multiple data mining problems focuses on specific problems of big data (as alternative solution for distributed computing like Spark). More details are given at the end of this resume.
Recommend model
Ensemble incremental learning to deal with streaming data and limitation of memory system
Ensemble model for solving problem of unknown classes in real world application
Improving the performance of Ensemble learning with combiner approach.
These topics are main part of my research during Phd study which presents in following publication
Research
2017
“Breaking a challenge for text classification in open world recognition” accepted paper on IEEE CCWC 2017(Annual Computing and Communication Workshop and Conference), Jan 9 -11 Las Vegas
2016
“Sentiment Analysis of Restaurant on Yelp review” accepted short paper on
IEEE ICMLA (International Conference Machine Learning and Application), Dec 18-20 Anaheim, CA.
“Algorithm Selection using performance and run time behavior”
Regular paper, AIMSA 2016 (International Conference on Artificial Intelligence: Methodology, Systems, Applications), Sept 3-5, Bulgaria.
“Predicting run time of classification algorithms under meta-learning approach” Regular paper on Journal Machine Learning and Cybernetic. June 2016
2015
“Selecting Machine Learning Algorithms using Regression Models”
Workshop paper, IEEE ICDM (International Conference on Date Mining), 14-17 November, Atlantic city, New Jersey.
Certificates
Course name
Date Issued
Neural Networks for Machine Learning by University of Toronto
Jan 2017
CS1156x: Learning From Data (Machine Learning) by Caltech
Dec 2016
Text Retrieve and Search Engine by University of Illinois at Urbana-Champaign
Dec 2016
Text Mining and Analytics by University of Illinois at Urbana-Champaign
Nov 2016
Machine Learning by Stanford University
Oct 2016
Cluster Analysis in Data Mining by University of Illinois at Urbana-Champaign
Oct 2016
Machine Learning: Clustering & Retrieval by University of Washington
August 2016
Data Visualization and Communication with Tableau by Duke University
June 2016
Managing Big Data with MySQL by Duke University
May 2016
Machine Learning: Classification by University of Washington
March 2016
Machine Learning: Regression by University of Washington
Dec 2015
Machine Learning Foundations: A Case Study Approach by University of Washington
Nov 2015
CS190.1x: Scalable Machine Learning by University California Berkeley
August 2015
Practical Machine Learning by Johns Hopkins University
August 2015
R Programming by Johns Hopkins University
June 2014
Details in research conducted during my PhD candidacy at University of Colorado Colorado are described as following:
The goal is to find the alternative solution for distributed computing using a top-down approach to address two common problem of a big data: training data is larger than the memory capacity and the training data evolves over time.
First stage of research aims to find a solution for the selecting data mining algorithm given data (assuming that data source sbreakdown in small chunks to fit memory computer constrainst). My solution is to build a recommend system using a regression model. Original data (often with different number of features) will be transformed into a same format, invariant features using dimensional reduction techniques. At the time of publised research paper, I explored only PCA, Kernel PCA, Factor Analysis (unsupervised) and Partial Least Squares (supervised). Later, I explore the use of t-SNE (t-Distributed Stochastic Neighbor Embedding), auto-encoder as more advantage techniques. New data features are generated by statisitcal summaries and using Box-Cox transformation to adapt to linear regression assumption. For regression model, I used Multivariate adaptive regression spline (MARS) and CUBIC regression. Note that our assumption is to less likely to guarantee the same distribution for each sampling subsets such that the task of choosing ML algorithm for each subset data can benefit by the first stage of research. My unpublished research involves in combination approach for all outputs from all ML models for each subsample data where I suggest to apply Mutli-arm bandit algorithm (commly in A/B testing) for the combination of ensemble members. R is only used in this project. I apply statistical test for multiple algorithms on multiple dataset with non-parametric Frieman test.
My second stage of research deal with a big data using incremental learning model. My solution is building an incremental Random Forests which inspires from Mondrian splitting approach. Main work focuses on how to monitor the performance of each ensemble member to pick up the candidate for removal and how to decide the split of a particular node on the fly. Python is an mainly used programming language in this project while I use R for statistical analysis. We compare to other current online approach like Online Random Forests (popular in computer vision task), Hoeffding Tree and using incremental Naïve Bayes as a base classifier. Only the result of experiments on NLP (Yelp review dataset) is published. For statistical analysis, I use R.
My third stage of research deals with one particular problem which results in a new trending research known as Open Set Recognition. The idea is that traditional ML assumption may not be guaranteed in the real world problem when we do not have sufficient knowledge of training classes. It may cause by a data source such as streaming data when data may evolve and present a new class on time line. For example, it is a case of unknown classes in traning is presented as testing in object detection. Other example is in forensic authorship attribution when the task is to detect whether or not unknown text belong to one of suspected authors. Theory involves Weibull distribution and Mahalanobis distance metric. I show that Mahalanobis distance is more appropriate in this work and anomaly detection application as well due to its invariant and unitless. My building model is incrementla ensemble model but it is not developed in previous work during some limitation of Mondrian process for reliable output. Instead, I developed modified version of Nearest Class Mean (original version known as Richio algorithm in Information Retrival). Main points are using clustering approach (DBCAN) for sphere boundary class search and a new strategy for creating new boundary. Only text classification experiment is reported in published paper with presentation due to the free charge on number of papers. Python is used entirely in the project including statistical analysis and ad-hoc test. The remain experiments include a comparison with numeric data sets and results in case of offline learning with neural network approach (convolutional neural network, recurrent network).
My research also include meta-analysis on microarray data sets. Beyond that I have collaboration with one PhD candidate in anomaly detection with mobile device.