Sign in

Data Computer Science

Sunnyvale, California, United States
April 20, 2017

Contact this candidate


(571) ***-**** 718 Old San Francisco Rd, Apt 255, Sunnyvale, CA, 94086 EDUCATION

University of Illinois at Urbana-Champaign Aug 2015 –– May 2017 Master of Science (M.S.) in Statistics Illinois, USA

• Current GPA: 3.91 / 4.00

• Relevant Coursework: Statistical Learning (R) Database Systems (MySQL, Java) Data Mining (Python) Communication University of China Aug 2011 –– Jul 2015 Bachelor of Science (B.S.) in Statistics Beijing, China SKILLS

Proficient in R, Python, SQL, SAS. Familiar with Hadoop, Tableau, C++, Java, MATLAB ACADEMIC EXPERIENCE

Wikipedia PageRank with Hadoop Mar 2017 Present

• Used Hadoop to generate the inverted index of Wikipedia dumps and calculate the PageRanking

• Developed a search program where users input key words and get top important Wikipedia links MovIen: Database-Driven Web Application Oct 2016 Dec 2016

• Processed the 17GB IMDb movie database, using SQL to design relational database schema

• Used Flask to build the server to achieve basic functionalities, using SQLAlchemy to connect database

• Used Python and JavaScript to build an interactive visualization explorer of movie market Topic Modeling and Frequent Pattern Mining Oct 2016 Nov 2016

• Preprocessed 30,796 paper titles from five domains of computer science by stemming and tokenization

• Ran Latent Dirichlet allocation(LDA) to assign a topic to each term and reorganized terms into certain topic

• Implemented Apriori in Python, respectively mined frequent patterns, closed and max patterns for each topic

• Re-ranked the frequent patterns by purity, to find the patterns only frequent under certain topic Spam Classification Based on Email Corpus Jul 2016 –– Aug 2016

• Preprocessed unbalanced 4000 emails by porter stemming, extracting word frequencies as predictors

• Built Linear SVM model in R with test accuracy 98%, fixing unbalance by using AUC measurement

• Built Logistic Regression with Elastic Net penalization in R, fixing unbalance by cross-validating threshold Feedback Prediction for Blog Post Oct 2016 –– Dec 2016

• Data preprocessing: Spotted extremely skewness and multicollinearity problem in training data set

• Built two tree-based models, Random Forest and Ada boosting, using Python scikit-learn, pandas libraries

• Built Group Lasso, Principle Component Regression (PCR) models in R to manage high dimension sparse matrix, with classification first to remedy the zero-inflation SimpleDB: A Basic Database Management System Oct 2016 –– Dec 2016

• Used Java to extend the functionalities of a simple database management system

• Implemented a collection of operators (GROUP BY, ORDER BY, aggregates e.g. AVG, SUM, MAX)

• Implemented BTree index and selectivity to achieve query optimization, speeding up query procedure WORK EXPERIENCE

Thomson Reuters Corporation, Beijing Bureau, Research Intern Jan 2015 –– May 2015

• Collected data, cleaning historical database of over 10 years for financial analysis product Eikon

• Wrote research reports supported for news coverage, including identifying data need, data acquisition, data analysis and visualizations, provided data and results before journalists know they need them

• Participated the oil refinery poll, summarizing the full view of China crude oil production

• Analyzed corporate annual reports, predicting the historical production trend Survey Statistics Institute (SSI), Data Analyst Intern Jul 2014 –– Aug 2014

• Case 1: Divided 200,000 cable users into groups by K-means clustering method Visualized user portraits in Tableau, guiding networks customizing programs to attract more audience

• Case 2: Using Python to access twitter API, fetched 48,000 tweets texts and estimated sentiment Estimated public’s perception over a particular term that is not in vocabulary list

Contact this candidate