Sun Xie
https://www.linkedin.com/in/sun-xie-*****2a6/ *****@*******.*** 765-***-****
EDUCATION
Cornell University, Ithaca, NY Dec 2017 (Expected) Master of Professional Study in Applied Statistics Overall GPA: 3.70/4.00 Purdue University, West Lafayette, IN Aug 2016
Double major in Mathematics and Statistics Minor: Management Overall GPA: 3.78 /4.00 (Dean’s List and Semester Honors) COMPUTER SKILLS
R, Microsoft Excel, SQL (Oracle, MySQL), Python (numpy, scipy, sklearn), Hadoop (Map-Reduce, Hive, Pig), SAS (Certified Base Programmer for SAS 9), Machine Learning, Web Crawling, Tableau, Latex, Java, Linux, Statistical Modeling PROFESSIONAL EXPERIENCES
Data Scientist Engineer, Part Time STOWK Berkeley, CA Oct 2017 – Present
Scraping 600,000 Data including Locating AIS tracking system, all Ports, vessels from dynamic tables on marine websites via API by python and store them into remote MongoDB Database
Visualized the world map with name, location and pictures for each port through the Interactive plots via Bokeh package. Data Analyst Assistant II Cornell University Department of Human and Ecology Aug 2017 – Present
Draw insights from SAS statistical outputs, summarize resulting from experimental designs, generate reports containing tables, charts, interpreting results to meet experimental objectives
Showed the statistical within/between group comparison graphics for each predictor via Tableau and Publisher and gave the presentation to Linguists once per week.
Student Data Analyst The Raymond Corporation Greece, NY Aug 2016 – May 2017
Collected seasonal data from customer/partner companies’ financial statements and monthly stock return, cleansed data using R, transformed data to monthly basis for analysis and used R packages to filter possible predictors
Drew autocorrelation graph between each variable with the total industry orders in each class to prove insignificance
Utilized panel data (random, fix model, OLS) to build models for predicting total industry order, and used SUR (seemingly unrelated regression) to forecast the individual customer company’s order based on the total industry results
Hosted bi-weekly meetings with the clients to report progress and gather feedback, performed project management by closely monitoring project status and milestones to make sure client’s expectations were fully met Business Analyst SAP Singapore Summer 2015
Performed due diligence on multiple companies’ profile including revenue, operational cost, sunk cost and future plans
Presented findings and conclusions of the annual results, and provided strategies and recommendations to the clients
Analyzed and predicted risks of target companies linking to SAP’s products, including CRM, SCM and cloud ACADEMIC PROJECTS
In Class Kaggle Competition (HandWriting Distinction) Cornell University Fall 2017
Utilized the spectral clustering algorithm to calculate adjacent matrix and first 40 eigenvectors matrix from the first 6000 training points’ graphic dataset.
Used Canonical Correlation Analysis to combine the 6000 training data with 40 eigenvectors matrix for getting the first 28 eigenmatrix and applied the eigenmatrix to the training set.
Got the initial 10 clusters by K-means Algorithm and did the majority vote based on the 60 labeled points.
Applied the supervised learning (Kernel SVM, RandomForest, Neural Network, KNN) to forecast the last 4000 testing set based on the training set.
Data Analysis Honeybee Genes Project Cornell University Spring 2017
Designed three websites, UserInput page including filename and path, Processed Page including success acknowledgement, for data processing and storage in Oracle database and ShowUp Page which shows up the final analyst result table.
Extracted Gi number, nucleotide sequence of each gene, calculated the relative frequencies and combination of each nucleotide G, C, GC and created an Oracle table for processed data to input.
Connected Python to Oracle database, conducted K-Means cluster analysis via Jupiter and plotted a 3D scatter plot. Flight Delay Course Project Cornell University Spring 2017
Downloaded sixe online files and dataset zip package into created HDFS folder, decompressed the files.
Created a database via Hive through CLI and used Pig in Ambari to build relations for loading the all files into the script.
Calculated average delays, longest delays, Pearson Correlation Coefficient by writing Python User Defined Functions.