Jackie Shu
************@*****.***
Data Scientist
* ***** **** ********, *** Data Science experience. Expert handling large datasets and managing complex business problems to derive useful insights. Applied deep learning, artificial intelligence, and statistical methods to data science problems to increase understanding and enhance profits and market share. Developed algorithms and implemented novel approaches to non-trivial business problems in a timely and efficient manner. Comfortable leading teams or being a team member, full stack machine learning engineer and data scientist. Years of experience presenting results and communicating result metrics to stake holders.
TECHNICAL SKILLS
Programming
Python – 7 years
R – 5 years
SQL
MatLab
Python Libraries
Tensorflow, Pytorch, NLTK, Numpy, Pandas, OpenCV, Python Image Library, Scikit-Learn, Scipy, Matplotlib, Seaborn
IDE
Jupyter
Google Colab
Pycharm
R Studio
Version Control
GitHub
Bit Bucket
Cloud Data Systems
AWS
GCP
Azure
Data Visualization
Matplotlib
Seaborn
Plotly
Folium
Computer Vision
Convolutional Neural Network (CNN)
HourGlass CNN
RCNNs
YOLO
Generative Adversarial Network (GAN)
Natural Language Processing
Sentiment Analysis
Sentiment Classification
Sequence to Sequence Model
Transformer
Bert
Regression Models
Linear Regression
Logistic Regression
Gradient Boosting Regression
L1 (Lasso), L2 (Ridge)
Tree Algorithms
Decision Tree, Bagging
Random Forest
AdaBoost, Gradient Boost, XGBoost
Random Search and Grid Search
Analytical Methods
Exploratory Data Analysis
Statistical Analysis
Regression Analysis
Time Series Analysis
Survival Analysis
Sentiment Analysis
Principal Component Analysis
Decision Trees, Random Forest
WORK EXPERIENCE
Sr. Data Scientist – COVID19 Analysis
Nestle USA Arlington, VA (Remote)
June 2019 – Present
The goal of the most recent initiative was to derive insights from the COVID-19 datasets. There were two types of datasets: images and patient records. We built models for both datasets individually. For the image dataset, each image was a CT scan of the lungs and labeled either COVID or Non-COVID. For patient records, the dataset was more complex which involves 5 csv files. Each csv contained different contents. The csv files involved patients’ travelling routes (what kind of place they have been, latitude and longitude), the periods between they were healthy and got isolated in the hospital, the increment of infected population respect to the time, etc.
●For Patients’ Records
Did exploratory data analysis and lots of data visualization
Did feature engineering, such as seeking what kind of place people visited the most frequent
Applied time series model (such as ARIMA and LSTM) to predict the increments of the infection
Created dashboard in Tableau for visualization to demonstrate the expected increase or decrease in infection rates
Observed feature importance to provide insight on which factors increase likelihood of COVID-19 diagnosis
●Image Dataset (contains 300 labeled images)
Used unsupervised learning (autoencoder) and data augmentation to extract the essential features from both categories
Utilized the features from above to do supervised learning
Selected pre-trained models to do transfer learning
Achieved 80% accuracy for binary classification
Created visualizations to show which areas of the images signified diagnosis or non-diagnosis
The project was completed, but never reached the production stage.
Programming Language Used: Python, R, SQL
Framework Used: TensorFlow, keras, openCV, pandas, numpy, matplotlib, PIL, folium, Tableau
Data Scientist / NLP Engineer
Home Depot Marietta, GA
2018 April – 2019 April
Home Depot provides their best services to their customers and they are still looking forward to improving the quality of services. I worked in this company as an NLP engineer. What I did was to build a multi-task machine learning model to extract the most meaningful words in a sentence and classify the sentiment of that sentence simultaneously. The dataset was labeled and contains 300k reviews. Each review was labeled with its most meaningful words and its category.
●Understood the significance of the service reviews and business.
●Dissected the problem into two (divide to conquer): sentiment extraction and sentiment classification.
●Cleaned the data: remove symbols, emotes, links, etc.
●Used TensorFlow to build two models: sentiment classifier and sequence to sequence model. Implemented position encoding, self-attention, and attention mechanisms in both models. Combine sentiment classifier and sequence to sequence model into one. The metrics I used were recall, precision and Jaccard score (F1 score).
●Used LIME problem explainer library to demonstrate feature importance and explain what words and phrases indicated a negative review.
●Provided insight using the model into what services needed improvement to help guide upper-level management in the right direction.
●Since negative reviews were more important, we paid more attention to the false positive. The final recall and precision for negative reviews were 91% and 85%.
●Deployed the API on Microsoft Azure.
Programming Language Used: Python
Framework Used: tensorflow, pandas, numpy, matplotlib, seaborn, LIME
Data Scientist
Procter & Gamble Cincinnati, OH
September 2016 – March 2018
Procter & Gamble (P&G) is one of the leading companies which provides us high quality daily products. The role I was working in P&G is data scientist. During the period, I analyzed diaper images and provided models to predict whether babies were comfortable with the diapers or not. The labeled images were very limited (around 800 images). However, we have around 30,000 unlabeled images. This job was a part-time job in which I was required to work at least 20 hours per week since I was still in a role of deep learning researcher in the University of Cincinnati.
●Understood diaper images- for example, dry diapers and wet diapers. Diapers which were uncomfortable (duck tail, bulky) and comfortable (nicely fit) for babies.
●Used P&G developed algorithms to segment and extract only the diapers for classification tasks.
●Manually segmented diapers from the babies’ bodies by using photoshop for better quality of the segmentations to improve the classification performance.
●Used pre-trained models to visualize the feature maps in the intermediate layers and performed transfer learning.
●Applied unsupervised learning on 30,000 unlabeled images using an autoencoder.
●Extracted the latent space features and utilized them for supervised learning by using 800 labeled images.
●Presented results to nontechnical audiences and demonstrated superior performance of transfer learning.
Programming Language Used: Python
Framework Used: tensorflow, keras, openCV, scipy, pandas, numpy, matplotlib, PIL
Data Scientist / Deep Learning Researcher
University of Cincinnati Cincinnati, OH
September 2013 – December 2016
I worked at the University of Cincinnati as a deep learning researcher for 3 years. During the period, the curriculum systems in the university lacked deep learning contents, especially for computer vision. For the needs of my advisor, I built machine learning programs to assist his courses, and I frequently went to his computer vision course to present some projects I had done. Meanwhile, me and my labmates participated in the China Physiological Competition for electrocardiogram classification.
●Used tensorflow and keras to build neural networks for image classification, image segmentation, object detection and image captioning.
●Used pre-trained models (VGG16, ResNets, Inceptions, DenseNet, U-Net, etc.) for transfer learning on small datasets.
●Assisted professor for teaching courses (computer vision).
●Presented deep learning concepts and explained projects/demos to encourage and motivate students.
●Helped students to solve programming issues and answer questions during office hours or in the classes.
●Self-taught and understood most recent deep learning technologies.
●Built an electrocardiogram classifier by using tensorflow.
Programming Language Used: MatLab, Python
Framework Used: tensorflow, keras, openCV, scipy, pandas, numpy, matplotlib, PIL
Education
Master of Science - Electrical Engineering
University of Cincinnati
Bachelor of Science - Electrical Engineering
University of Cincinnati