Associate Data Analyst Intern

Location:

St. Louis, MO

Posted:

April 10, 2023

Contact this candidate

Resume:

Education

Washington University in St. Louis St. Louis, MO

Master's Degree in ENGINEERING DATA ANALYTICS AND STATISTICS 05/2022 Overall GPA:3.91

Invited to be TA for class Detection and Estimation Theory and class Statistical Methods for Data Analysis with Applications to Financial Engineering

Purdue University West Lafayette, Indiana

Bachelor's Degree in Statistics 05/2020

Minored in Psychology

Graduated summa cum laude—3.54 GPA

Dean’s Lists Award in Fall 2019

Semester Honor in Spring 2018, Fall 2018, Fall 2019 Work Experience

MilliporeSigma by Randstad St. Louis, MO

Data Scientist and report analyst 07/2022 - Present Manipulates and analyzes complex, high-volume, high-dimensional data using Rstudio, Jupyter, and Excel Uses machine learning algorithms, libraries, and tools to build machine learning and statistical models (Cubist, xgboost, random forest, logistic regression, etc.. )

Applies statistical learning models and knows how to transfer a business problem to a solution with applied data science Applies data analysis, data mining, and data processing using SIMCA, R and Python to present data clearly Proposes improvements for data connectivity and liaises with Data Engineers Helps to build overall awareness of analytical capabilities within the company Project 1: Developing and testing a machine learning-based multivariate data analysis (MVDA) tool to generate the best formula for protein productivity.

Use R and Python to perform data exploration: Identify relationships and trends in data as well as any factors that could affect the research results using statistical methods, such as Pearson Correlation and ANOVA table in R. Plot heatmap and boxplots to visualize data correlation and identify outliers. Use packages from Python including Pandas, NumPy, Scikit-learn, etc.. to design, build and improve machine learning models including Logistic Regression, Bayesian, Random Forest, etc.. Use Python to do performance evaluation and deployment for risk score calculation including ROC Curve, confusion matrix, etc..,

Work end to end from data collection, feature generation, and selection, algorithm development, forecasting, visualization, and summarizing model results.

Collaborates on the design, documentation, testing, and implementation of research studies with bioscientists. Use Palantir and Tableau to create curated data ontology and data visualization Use ELN for final reporting

Do presentations as a presentative of my group about the whole project during meetings Project 2: Develop and manage a database for the stockroom. Use Pandas on Python to manage the data frame of the dataset and clean the dataset. Use MySQL to define the tables, columns, and relationships between data elements. Then populate the database schema by importing data from spreadsheets.

Write queries, such as using window functions to do some calculations in order to meet stakeholders' needs. Keep monitoring database performance, optimizing queries, and performing regular backups to prevent data loss. Beijing Cavo Public Relations Consultants Beijing, Beijing Associate Data Analyst Intern 12/2019 - 06/2020

Data collection and cleaning: I collect and organize data from various sources, ensuring that data is accurate, complete, and of high quality.

JINGYA WANG

765-***-**** **************@*****.*** 8300 Delmar Boulevard, Apartment 112, St. Louis, MO Data analysis: I conduct exploratory data analysis to identify patterns, trends, and insights that can inform business decisions. This includes using statistical tools such as regression analysis or time series analysis to develop insights. For example, I use Rstudio to conduct a data regression analysis of the relationship between company revenue and industry trends to help our group get a better understanding of our data and make better business decisions. Reporting: I create reports and dashboards to communicate findings to stakeholders, using data visualization tools. For example, I increased the accessibility and usability of customer data by redesigning data visualization techniques to include statistical graphs and informational graphics using R, Tableau and Excel. Statistical modeling: I build statistical models using tools such as R or Python to help predict future trends or outcomes. This includes building models such as decision trees, random forests, and neural networks. Collaboration: I work closely with team members and stakeholders to ensure that our data analysis aligns with business objectives and provides value to the organization. I join weekly meetings to collaborate with other teams to develop solutions and to do presentation to explain my work to my colleagues who don't have any statistical background. Project

Data Visualization for Business Insights Project Tableau Used Tableau and Tableau Prep to do Exploratory Data Analysis of Suicide Rates dataset. Created an exploratory Dashboard including World Map, 100% Bar chart, Population chart, and Basic Table to analyze the relationship between People and the suicide rate. Created an exploratory Dashboard including Quadrant Chart, and Time series with forecasting to analyze the relationship between the Economy, HDI, Suicide forecast, and the suicide rate. Created story points and generated insights and recommendations. Data Mining Project Python, data mining

Use Python to do data preprocessing and to implement k-nearest-neighbors, decision tree, random forest, 2 SVM methods (using a polynomial kernel of your choice and a Gaussian kernel), and 2 (deep) neural networks with Sigmoid activation and ReLu activation functions for model selection on the Breast Cancer Wisconsin (Original) Data Set. Achieved 97% accuracy in predicting Breast Cancer. Practicum in Data Analytics & Statistics Python, machine learning, classification Use Jupyter Notebook to implement the k Nearest Neighbours Distance method for outlier detection, one-hot encoding for categorical data, and Label Encoder for binary data on the Bank Marketing Data Set of a Portuguese banking institution.

Use Synthetic Minority Over-sampling (SMOTE) Technique to improve the imbalance binary dataset from 964:3103 to 3103:3103

Implement Logistic Regression, Random Forest, and Gradient Boost for model selection. Tune hyperparameters using cross-validation. Analyze model performance using confusion matrix, ROC, and AUC. Achieved 88% accuracy for predicting if the client will subscribe to a term deposit. Organize meetings with group members leading to the adoption of imbalance dataset improvement and model selection.

Machine Learning and Pattern Classification Jupyternotebook Design, implement and test classification algorithms to achieve the best classification performance on the red wine quality data set

Use Principal component analysis (PCA) to reduce the dimensionality of a dataset and increase the interpretability of data. Use boxplot to detect outliers and use labelencoder to label categorical y. Impleplement Support Vector Machine method and Random Forests method on the dataset. Use Train-Test Split approach and Grid Search using cross-validation method (GridSearchCV) to evaluate the performances and to tune the hyperparameters of SVM and RF

Confusion matrix is implemented for model performance measurement As a result, Random Forests achieves 82% accuracy, which is higher than SVM method Applied Linear Regression Rstudio, linear regression Generated a predictive linear model on R studio to analyze the relationship between sound pressure and numerous variables.

Produced Q-Q plot visualization; Implemented Brown-Forsythe test and Shapiro-Wilk test, P values, and remedy; Employed stepwise selection method and ANOVA analysis for model evaluation. Interpreted the result through written reports and presentations.

Skills

Microsoft excel, Microsoft word, Python, R, SQL, Tableau, AB testing, Machine learning, Tableau prep, palantir, ELN, SIMCA, LaTeX, User Acceptance Testing(UAT), Jupyter notebooks, PANDAS, NumPy, scikit-learn, Logistic Regression, Random Forest, Neural Networks, Bayesian, ANOVA, Pearson Correlation

Contact this candidate