YAN CHENG
*** **********, ******, ** ***** ************@*****.*** 803-***-****
EDUCATION: Master of Statistics, Texas A&M University, Jan 2021 - Aug 2022, GPA 3.82 CERTIFICATES: SAS Certified Base Programmer
PROJECTS (https://github.com/Yan8866 for more details) Lung Cancer Data Analysis(R): Compared four machine learning models: Logistic Regression, Decision Tree, Pruned Tree and Random Forest and then selected the model with the best interpretability and prediction power- Logistic Regression to identify significant predictors of lung cancer and predicted how likely a person will develop lung cancer with certain features. Volatility Analysis of the Returns of S&P500 (R): Built ARCH (1) and GARCH (1, 1) models to analyze the volatility of the returns of S&P500, and predicted the volatility of returns. Credit Card Fraud Detection (Python): Trained Linear Regression, Decision Tree, Random Forest and Support Vector Machine models with Scikit-Learn package, used the models to predict how likely a transaction is fraud, then compared their prediction error rate. Random Forest has the greatest prediction power.
Insurance Fraud Detection (Python): Used packages pandas, numpy, seaborn, matplotlib, Plotly, missingno, Scikit-Learn and XGBoost to analyze an insurance data set with 1000 entries and 40 variables; trained nine classifiers: SVM, KNN, Decision Tree, Random Forest and so on. Decision Tree has the best prediction accuracy with a test accuracy rate of 0.816 TECHNICAL SKILLS
Using version control system Git and its host GitHub Familiar with big data tools like Hadoop and Spark Using Docker to share work and reproduce project results Using Tableau to build dashboards, visualize and analyze data Supervised & Unsupervised Learning: Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, K-Nearest Neighbors, Bagging, Random Forest, Boosting, Support Vector Machines, Principal Component Analysis, Clustering Analysis Experiment design and sampling methodologies
Building predictive models using Frequency Method and Bayesian Method Parametric, semiparametric and non-parametric techniques Spatial data, time series data and categorical data visualization and analysis OTHER SKILLS
Excellent written and spoken communication skills
Stakeholder management: planning, negotiation, problem solving, conflict resolution Organization and time management
PROGRAMMING LANGUAGES: R, Python, SQL, SAS
WORK EXPERIENCE: Classroom Teacher
Sep 2016-Jun 2017: Marlborough High School (MA,USA) Aug 2013-Jun 2015: White Knoll Middle School (SC,USA) Aug 2011-Jun 2013: East Point Academy (SC,USA)