Data Mining, Predictive Modeling, Database, Operation Analysis

Location:

College Park, MD

Posted:

April 30, 2019

Contact this candidate

Resume:

Yichi (Sylvia) Zhang

Address: Greenbelt, MD ***** Phone:812-***-**** *****.*****@*******.***.***

EDUCATION

University of Maryland, R. H. Smith School of Business, College Park, MD Dec 2018 Master of Science in Business Analytics; GPA: 3.56/4.00

Selected Courses: Database Management(A); Data Mining and Predictive Analytics(A);Big Data and AI(A);Data Processing in Python

Indiana University, College of Arts & Sciences, Bloomington, IN May 2016

Bachelor of Science in Apparel Merchandising (Fashion Marketing)

Honors: Leah Weidman Scholarship, Phi Sigma Theta National Honor Society, Executive Dean’s List Minor(s): Marketing, German

TECHNICAL SKILLS

Languages: Python, R, SQL, SAS, Unix, JavaScript, HTML5

Big Data: Spark, Pig, Hive, Hadoop, MapReduce, AWS (S3, EC2, EMR, Redshift, Aurora, VPC)

Statistics and Modeling: Linear Regression, Logistic Regression, Random Forest, PCA, Clustering, XGBoost, Time Series, Hypothesis Testing, A/B Testing, Natural Language Processing, Latent Semantic Analysis, Image Processing

Tool and others: Agile methodology, Scrum, CRISP-DM, Alteryx, Tableau, Power BI, Excel (Advanced), Google Analytics

EXPERIENCE

Strategy & Performance Analysis Intern II AARP, Washington DC Jan 2018–Dec 2018

Lead membership analysis team prediction projects using machine learning models; Conduct data ETL and visualization reports for Social Engagement Community (SEC), Advocacy, and Multicultural teams and archive versions on SharePoint

Collected and transformed raw data (230K rows) from state branch into national office’s database and conducted data preprocess using Python NumPy and Pandas packages, including accuracy verification, imputing missing value, text format standardization

Extracted customer email information (15,000,000 rows) for analytics team from Redshift using SAS queries

Predicted membership acquisition response rate with direct mailing channel for 3Q18 - FY19 using Time Series model for analyst team to plan marketing acquisition budget

Predicted events return rates with Linear Discriminant Regression and Random Forest using Python scikit-learn based on events’ (100,000 events, 30 features) historical data and achieved 86.4% accuracy, and drafted events promotion strategy reports for marketing team

Raised relevancy score of results returned from company internal search by 62% by creating guidance of step-by-step tagging and keyword search lists

Data Scientist Associate CITIC Co. Ltd, San Jose, CA Sep 2016–Aug 2017

Trained 1,000 local store managers and associates in 220 stores in China for data logging to improve data quality

Cleaned sales and transaction data using SQL queries for over 600,000 raw data collected from off-line and on-line channel

Implemented a “customer shopping footprint” dashboard based on over 450,000 customers’ profile and purchase behavior with Navicat Premium (SQL database), Tableau, HTML5 as the prototype for engineering team, iterated functions the initial Java-based platform based on feedback of users

Drafted reports for analysis of typical PERSONA for off-line customers and presented to stakeholders

Operation Analyst Intern ELLASSAY Fashion Co. Ltd, Shenzhen, China Jul 2014–Aug 2014

Collected sales performance data of 239,000 SKUs including current sales, stock turnover, and customer satisfaction

Optimized the inventory distribution plan based the data above to maximize total revenue

Summarized the Fashion Trend for 2015 F/W by collecting statistics of the clothes’ materials, colors, patterns, silhouettes and themes and reported to the director of design department

SELECTED PROJECTS

Text Analytics with Principal Financial: to build AI machine detecting the presence of whether texts scraped from third party resources are about government policy, trade restriction, lawsuit, intellectual property, or trademark, to provide reports for financial analysts:

Labeled parsed text in four weeks to set up training data and received over 2000 useful text

Created data pipeline and experimented on traditional ML methods with numerical variables— Logistic Regression, Naive Byes, SVM, Random Forest, Gradient Boosting — Random Forest returns highest accuracy at 82%

Deployed TF-IDF on text data and created bag-of-words features, transformed the matrix into new logistic regression

Applied Ensemble Method passing weight from Random Forest and from Logistic Regression derived from TF-IDF to build stronger model, and achieved highest accuracy at 88%

Increased highest accuracy from 76.02% with only features given by client to 88% after feature engineering and ensemble method

Recommended client to use model with new features created, to drop unimportant features, and to create more diverse tagging

Contact this candidate