Jackie Shu - Data Science

Location:

Arlington, VA

Salary:

Posted:

July 31, 2020

Contact this candidate

Resume:

Jackie Shu

786-***-****

************@*****.***

Data Scientist

* ***** **** ********, *** Data Science experience. Expert handling large datasets and managing complex business problems to derive useful insights. Applied deep learning, artificial intelligence, and statistical methods to data science problems to increase understanding and enhance profits and market share. Developed algorithms and implemented novel approaches to non-trivial business problems in a timely and efficient manner. Comfortable leading teams or being a team member, full stack machine learning engineer and data scientist. Years of experience presenting results and communicating result metrics to stake holders.

TECHNICAL SKILLS

Programming

Python – 7 years

R – 5 years

SQL

MatLab

Python Libraries

Tensorflow, Pytorch, NLTK, Numpy, Pandas, OpenCV, Python Image Library, Scikit-Learn, Scipy, Matplotlib, Seaborn

IDE

Jupyter

Google Colab

Pycharm

R Studio

Version Control

GitHub

Bit Bucket

Cloud Data Systems

AWS

GCP

Azure

Data Visualization

Matplotlib

Seaborn

Plotly

Folium

Computer Vision

Convolutional Neural Network (CNN)

HourGlass CNN

RCNNs

YOLO

Generative Adversarial Network (GAN)

Natural Language Processing

Sentiment Analysis

Sentiment Classification

Sequence to Sequence Model

Transformer

Bert

Regression Models

Linear Regression

Logistic Regression

Gradient Boosting Regression

L1 (Lasso), L2 (Ridge)

Tree Algorithms

Decision Tree, Bagging

Random Forest

AdaBoost, Gradient Boost, XGBoost

Random Search and Grid Search

Analytical Methods

Exploratory Data Analysis

Statistical Analysis

Regression Analysis

Time Series Analysis

Survival Analysis

Sentiment Analysis

Principal Component Analysis

Decision Trees, Random Forest

WORK EXPERIENCE

Sr. Data Scientist – COVID19 Analysis

Nestle USA Arlington, VA (Remote)

June 2019 – Present

The goal of the most recent initiative was to derive insights from the COVID-19 datasets. There were two types of datasets: images and patient records. We built models for both datasets individually. For the image dataset, each image was a CT scan of the lungs and labeled either COVID or Non-COVID. For patient records, the dataset was more complex which involves 5 csv files. Each csv contained different contents. The csv files involved patients’ travelling routes (what kind of place they have been, latitude and longitude), the periods between they were healthy and got isolated in the hospital, the increment of infected population respect to the time, etc.

●For Patients’ Records

Did exploratory data analysis and lots of data visualization

Did feature engineering, such as seeking what kind of place people visited the most frequent

Applied time series model (such as ARIMA and LSTM) to predict the increments of the infection

Created dashboard in Tableau for visualization to demonstrate the expected increase or decrease in infection rates

Observed feature importance to provide insight on which factors increase likelihood of COVID-19 diagnosis

●Image Dataset (contains 300 labeled images)

Used unsupervised learning (autoencoder) and data augmentation to extract the essential features from both categories

Utilized the features from above to do supervised learning

Selected pre-trained models to do transfer learning

Achieved 80% accuracy for binary classification

Created visualizations to show which areas of the images signified diagnosis or non-diagnosis

The project was completed, but never reached the production stage.

Programming Language Used: Python, R, SQL

Framework Used: TensorFlow, keras, openCV, pandas, numpy, matplotlib, PIL, folium, Tableau

Data Scientist / NLP Engineer

Home Depot Marietta, GA

2018 April – 2019 April

Home Depot provides their best services to their customers and they are still looking forward to improving the quality of services. I worked in this company as an NLP engineer. What I did was to build a multi-task machine learning model to extract the most meaningful words in a sentence and classify the sentiment of that sentence simultaneously. The dataset was labeled and contains 300k reviews. Each review was labeled with its most meaningful words and its category.

●Understood the significance of the service reviews and business.

●Dissected the problem into two (divide to conquer): sentiment extraction and sentiment classification.

●Cleaned the data: remove symbols, emotes, links, etc.

●Used TensorFlow to build two models: sentiment classifier and sequence to sequence model. Implemented position encoding, self-attention, and attention mechanisms in both models. Combine sentiment classifier and sequence to sequence model into one. The metrics I used were recall, precision and Jaccard score (F1 score).

●Used LIME problem explainer library to demonstrate feature importance and explain what words and phrases indicated a negative review.

●Provided insight using the model into what services needed improvement to help guide upper-level management in the right direction.

●Since negative reviews were more important, we paid more attention to the false positive. The final recall and precision for negative reviews were 91% and 85%.

●Deployed the API on Microsoft Azure.

Programming Language Used: Python

Framework Used: tensorflow, pandas, numpy, matplotlib, seaborn, LIME

Data Scientist

Procter & Gamble Cincinnati, OH

September 2016 – March 2018

Procter & Gamble (P&G) is one of the leading companies which provides us high quality daily products. The role I was working in P&G is data scientist. During the period, I analyzed diaper images and provided models to predict whether babies were comfortable with the diapers or not. The labeled images were very limited (around 800 images). However, we have around 30,000 unlabeled images. This job was a part-time job in which I was required to work at least 20 hours per week since I was still in a role of deep learning researcher in the University of Cincinnati.

●Understood diaper images- for example, dry diapers and wet diapers. Diapers which were uncomfortable (duck tail, bulky) and comfortable (nicely fit) for babies.

●Used P&G developed algorithms to segment and extract only the diapers for classification tasks.

●Manually segmented diapers from the babies’ bodies by using photoshop for better quality of the segmentations to improve the classification performance.

●Used pre-trained models to visualize the feature maps in the intermediate layers and performed transfer learning.

●Applied unsupervised learning on 30,000 unlabeled images using an autoencoder.

●Extracted the latent space features and utilized them for supervised learning by using 800 labeled images.

●Presented results to nontechnical audiences and demonstrated superior performance of transfer learning.

Programming Language Used: Python

Framework Used: tensorflow, keras, openCV, scipy, pandas, numpy, matplotlib, PIL

Data Scientist / Deep Learning Researcher

University of Cincinnati Cincinnati, OH

September 2013 – December 2016

I worked at the University of Cincinnati as a deep learning researcher for 3 years. During the period, the curriculum systems in the university lacked deep learning contents, especially for computer vision. For the needs of my advisor, I built machine learning programs to assist his courses, and I frequently went to his computer vision course to present some projects I had done. Meanwhile, me and my labmates participated in the China Physiological Competition for electrocardiogram classification.

●Used tensorflow and keras to build neural networks for image classification, image segmentation, object detection and image captioning.

●Used pre-trained models (VGG16, ResNets, Inceptions, DenseNet, U-Net, etc.) for transfer learning on small datasets.

●Assisted professor for teaching courses (computer vision).

●Presented deep learning concepts and explained projects/demos to encourage and motivate students.

●Helped students to solve programming issues and answer questions during office hours or in the classes.

●Self-taught and understood most recent deep learning technologies.

●Built an electrocardiogram classifier by using tensorflow.

Programming Language Used: MatLab, Python

Framework Used: tensorflow, keras, openCV, scipy, pandas, numpy, matplotlib, PIL

Education

Master of Science - Electrical Engineering

University of Cincinnati

Bachelor of Science - Electrical Engineering

University of Cincinnati

Contact this candidate