Sign in

Data Analyst

Washington, District of Columbia, United States
January 27, 2018

Contact this candidate



Cell: 202-***-**** -- E-mail:

**** ******** ***, ** ********** D.C.


50% Data Analyst/ Data Scientist/ Data Engineer & 50% Software Engineer/ Machine Learning Engineer TECHNICAL SKILLS and APPLICATIONS

• Python

• R/ R studio

• Java

• C++/C#


• Hadoop/MapReduce/Spark

• Tensorflow

• Tableau




Master of Science: Data Science, GPA: 3.9/4.0 2015-2017 May George Washington University – Washington D.C., USA Relevant courses:

Machine Learning; Data Analysis; HPC and Parallel Computing; Applied linear models; Visualization of Complex Data Bachelor of Engineering: Software Engineering, GPA: 3.32/4.0 2011-2015 Chongqing University of Posts and Telecommunications – Chongqing, China Relevant courses:

Process-oriented Programming; Object-oriented Programming; Database Design and Management; Data Structure; .NET Platform and Application Development


• User Churn Prediction Classification, Prediction, Machine Learning o Predicted the customers who are likely to stop using the service in the future by using KNN, Logistic Regression model and Random Forest

o Cleaned data and sorted data with library Pandas and Numpy; Standardized X matrix and trained models using 5- fold Cross-Validation, compared the confusion matrixes to evaluate models through the library Sklearn; Used grid search to find optimal parameters; Done features election for the logistic model with penalty l1, l2 using recursive feature elimination(RFE)

• Movie Synopses Clustering Classification, Nature Language processing o Divided 100 movies into groups according to the synopses combined from imdb and wiki using KMeans method and Latent Dirichlet Allocation(LDA)

o Tokenized and stemmed the synopses to build a dictionary for mapping words with its inflected form, excluding some common words; Used TF-IDF to select the most effective words and to concluded a TF-IDF weighting matrix; Ran both KMeans and LDA methods to clustered the movies; For KMeans method, the dimension of all the points were reduced to 2 dimension using PAC with the visualization to show the distribution

• Number Recognition Neural network

o Recognized letter “D”, “J”, “C”, “M” corrupted by noise with Hopfiled Network o Built a Hopfiled neural network learning pixel matrix of 5x5 of the four letters in Python; Trained the weight with correct letters and concluded the weight matrix; Tested the network with new inputs corrupted by noise

• Car price prediction Data extracting and Statistical model building o Built a model to predict the prices of Ford cars using the data extracting from website o Extracted data from website with Python libraries requests and BeautifulSoup; Cleaned data with Excel; Accomplished variable selection, removed the outliers, and concluded several models by using R. Got a model with smallest cross validation error

• Car Club Membership Management System Management System Development o A website providing a system used by car club; Customers could order service and take part in some activities on the website; The car club could deal with all the orders and post activities. o Selected C# and SQL Server 2008 r2 to construct this website

• Analysis of Energy use and CO2 Emission Data mining o Got data of energy use and co2 emission of 33 countries in 2006, 2007 and 2008 from the world bank website; Determined the relationship between energy use and co2 emission; Figured out the effect of terrestrial projected areas to energy use and co2 emission

o Used library Pandas in Python to do analyze to show the result tables; Done visualization with Tableau and D3; Stated the relationships between energy use, co2 emission and terrestrial projected areas by line charts and dynamic charts

• Pancreatic cancer Project Survival Analysis

o Compared two different treatments by analyzing the record of biopsy of patients taking those treatments o Used Regression imputation method to deal with missing values and used R to do the survival analysis; Concluded the survival rates for two treatments and showing them in a plot CERTIFIED

• Primary Microsoft SDE/ SDET Certified

• Google Cloud Platform Fundamentals (CP1000A) Completion Certified

Contact this candidate