Data Analyst

Location:

Paramus, NJ

Posted:

February 06, 2019

Contact this candidate

Resume:

Luyuan Zhang

Maywood, NJ ***** (M) 609-***-****

***************@*****.***

Professional Summary

I am proficient in data analysis and data mining.

Skills

● Python ● SQL ● Tableau ● Machine Learning

● ETL data pipeline ● Analysis ● Visualization ● A/B testing

Education

Data Analyst Nano Degree, Aug 2008

Udacity

Ph.D., Chemical Physics, Jul 2010

The Ohio State University, Columbus, OH, U.S.A

B.S., Chemical Physics, Jul 2003

University of Science and Technology of China, Hefei Anhui, China

Projects:( github.com/zhangluyuan/)

Data Analysis:

Investigate Kaggle European soccer data

-exploratory data analysis with Python

-extracted data using SQL queries; explored data using Pandas and Numpy; plotted visualization with matplotlib

-audited, cleaned, calculated new variables, and merged separate tables for a data set of 300+ MB, with 194K+ records and 115 features

-asked 6 questions, and ran hypothesis tests and calculated correlations to answer them

-The most striking finding is that winning odd is 60% for home teams and 40% for away teams. Home team is 50% more likely to win than away team

Explore red wine quality

-exploratory data analysis with R

-red wine quality data set of 12 variables and 1599 observations

-explored dataset for distributions, outliers, and anomalies

-quantified and visualized data using scatter plots, histograms, bar charts, and boxplots

-identified variables of interest (pH, density and wine quality)

-examined relationship between multiple variables by visualization and correlations

-built predictive models, calculated correlations, and investigated conditional means

Tableau story of baseball player performance

-interactive data exploration and visualization with Tableau

-dataset contains height, weight, handedness, batting average and homeruns of 1157 players

-created a new variable BMI to account for combined effects of height and weight on player performance

-chose optimal visual elements to encode data

-critically assessed the effectiveness of a visualization

-delivered key findings with effective visualizations: small players have better batting average, and big players have better home runs

A/B testing of webpage experiment

-cleaned data for missing values and mismatching rows

-hypothesis testing with Null and Alternative hypothesis

-ztest with statsmodels’ proportions_ztest

-test with logistic regression model

-all 3 methods imply that new webpage is not better than the old one

Machine Learning:

Supervised Learning- Finding Donors for CharityML

-investigated correlation between label and features through exploratory analysis

-processed data with log-transformation, normalization and one-hot encoding

-split data into training (80%) and test (20%) sets

-chose three models and evaluated their performances regarding running time and accuracy on training/test data

-chose the best model based on the performance and fine-tuned with GridSearchCV

-evaluated feature importance and their effect on model performance

Deep Leaning - Developing an AI application for Flower Image Classification

-loaded and transformed data using torchvision’s dataset and transforms module

-created dataloader using torch.utils.data.DataLoader method

-built and trained a pre-trained model (‘vgg19’)

-re-defined the last unit of the network with a new hidden layer (4096 units) and a new output layer of 102 classes

-saved the checkpoint to local drive

-re-loaded checkpoint and re-built the model for prediction

-wrote a train.py and a predict.py that performs all above functions with more flexibilities

Unsupervised Learning - Identify Customer Segments

-preprocessed training data (900,000 records and 85 columns) to deal with missing values and re-encode categorical features

-wrote a cleaning function that performs the preprocessing

-applied standard scaling and imputation with median

-performed dimensionality reduction with PCA, and decided the optimal number of PC based on the cumulative variance curve

-interpreted principal components based on feature weights of each PC

-applied KMeans clustering to partial training data, and decided optimal K based on scores, and then re-applied KMeans to complete set of training data

-Pre-processed client data with the cleaning function

-applied trained KMeans to client data, and compared results of training data result and client data result

-interpreted results based on cluster centers and made suggestion to client as to what types of customer they should focus on

Experience

Postdoctoral Research Scientist, Columbia University, New York, NY 05/2012 to 01/2018

Postdoctoral Research Scientist, Princeton University, Princeton, NJ 09/2010 to 12/2011

Graduate Research Assistant, The Ohio State University, Columbus, OH 09/2003 to 07/2010

Designed and implemented multiple optical spectroscopic methods for study of biological events

Developed independence and persistency in research, and habit of critical thinking

Developed goal-oriented attitude to meet project deadlines

Acquired team work skills

Developed strong instinct for data, including but not limited to what data is needed, how to acquire data, what data to present and how to present them

Developed strong presentation skills. Prepared 100+ figures for publication in prestigious journals

Developed strong writing skills. Wrote and published 10+ publications of scientific research

Contact this candidate