Annie Yang
*** ********* ****** **, *******, GA *0339 ac8drb@r.postjobfree.com 404-***-****
SKILLS
Programming: R (dplyr, glmnet, ggplot2), Python (scikit-learn, pandas, matplotlib), SAS (certified), SQL Statistics: Probability, Statistical Inference, Hypothesis Testing, Bayesian Inference, ANOVA Machine Learning: Supervised (linear regression, recommendation, bagging, random forest, boosting), Unsupervised (clustering, PCA), NLP Languages: English (fluent), Chinese (native), Japanese (intermediate) EDUCATION
Rollins School of Public Health, Emory University Atlanta, GA Master of Science (MS), Biostatistics May 2019
• GPA: 3.84/4.0 Relevant Coursework: Regression, R Programming, Machine Learning with Python, Survival Analysis, Statistical Inference Emory University, College of Arts and Sciences Atlanta, GA Bachelor of Arts (BA), Biology major, Mathematics minor May 2017
• GPA: 3.80/4.0 Dean’s List (2014 & 2016) Relevant Coursework: Linear Algebra, Bioinformatics, Java Programming, Cancer Genetics WORK EXPERIENCES
MediSix Therapeutics Singapore, Singapore
Data Analyst Intern May 2018 – September 2018
• Visualized and applied unsupervised machine learning algorithms on 2.5GB patient data to frame research plans for CAR-T drug development
• Established correlation analysis and discussed results with the Chief Scientific Officer to generate target feature sets and outcome metrics
• Summarized data patterns among leukemia samples by coding PCA and clustering algorithms via R and ggplot2 heatmap visualization
• Proposed target users for the new medicine based on derived conclusions, and assisted MediSix with drafting hypothesis for laboratory tests Winship Cancer Institute Atlanta, GA
Clinical Research Assistant September 2018 – Present
• Produced descriptive statistics via SAS Macro, and examined variations between treatment and control groups via t tests and Chi-square tests
• Developed logistic regression models to investigate the univariate and adjusted effects of each variable on binary treatment outcome
• Compared progression free survival and overall survival between treatment and control groups via log-rank test and Kaplan-Meier curves
• Built Cox proportional-hazards models to assess the adjusted effects of treatment on progression free survival and overall survival
• Evaluated and chose the best model through regression diagnostics, AIC, and forward model selection Rollins School of Public Health Atlanta, GA
Data Analyst August 2014 – May 2016
• Transformed and analyzed survey results in SQL and R to explore effects of physical activity and urbanization on children’s health
• Co-authored on papers ‘Healthfulness, Modernity, and Availability of Food and Beverages: Adolescents’ Perceptions in Southern India’, and
‘The Influence of Pediatric Oncology Summer Camp Attendance on Physical Activity, Fatigue, and Oxidative Stress’
• Discovered urbanization is associated with the development of secondary lifestyle among young adults through t test (p-value < 0.01)
• Concluded physical activity has stronger influence on fatigue for higher BMI category via Chi-square and Fisher’s exact test (p-values < 0.001) PROJECTS
Lending Club Risk Analysis Atlanta, GA
R Programming March 2018 – April 2018
• Fitted linear and logistic regression models on 3GB loan data to simulate credit risk via interest rate (continuous) and loan status (categorical)
• Performed data preprocess, missing imputation, and feature engineering for multi-type data including numerical, categorical, and timer serial
• Assessed model assumptions via residual diagnosis, reduced multicollinearity through regularization, and improved AUC from 0.64 to 0.81 Yelp Restaurant Recommender Singapore, Singapore
Python (Jupyter) June 2018 – July 2018
• Transformed 6GB unstructured review data to feature vectors by applying NLP methods, such as TF-IDF vectorization
• Implemented K-Means clustering algorithms on reviews, and investigated cluster centroids to understand user preferences
• Determined top attributes of positive and negative reviews via Logistic Regression and Random Forest; reduced overfitting via PCA
• Constructed collaborative filtering recommendation system based on predictive models to customize restaurant suggestions Breast Cancer Prediction Chicago, IL
Python (Jupyter) December 2017 – January 2018
• Utilized dimensionality reduction and classification to predict disease status, and to identify top features characterizing breast cancer
• Leveraged PCA to address multicollinearity assessed through correlation matrix, and to associate sample subgroups with clinical outcomes
• Established machine learning models with Logistic Regression, KNN, Random Forest, and Gradient Boosting to predict clinical outcomes
• Assessed model performances by calculating AUC, ROC curve, accuracy, precision, and recall; tuned each model by grid search with cross- validation; improved prediction accuracy from 86% to 97% on test data through feature selection, model comparison, and parameter tuning