Information Science Master Student at Cornell University

Location:

Ithaca, NY

Posted:

February 21, 2021

Contact this candidate

Resume:

EDUCATION

Joseph Kuo

330-***-****

******@*******.***

Cornell University, Department of Computing and Information Science Ithaca, NY Master of Professional Studies (MPS) – Information Science Expected Dec 2021

• GPA: 3.72

Cornell University, Department of Computing and Information Science Ithaca, NY Bachelor of Arts, Information Science May 2020

• GPA: 3.66

DATA SCIENCE COMPETENCIES

Languages & Tools: Python, R, SQL, Jupyter, pandas, dplyr, scikit-learn, PySpark, Tableu, Java, HTML, CSS, JavaScript, Excel Techniques: Regression, Classification, Decision Trees, Clustering, Neural Nets, Sentiment Analysis, Data Mining, Data Visualization, Data Analysis, Dimensionality Reduction, Boosting, Optimization, Dashboarding, Statistics, A/B Testing, Experimental Design, Qualitative Analysis, Data Structure, Object-Oriented Programming RELEVANT EXPERIENCE AND PROJECTS

Institute for Information Industry Taipei, Taiwan

AI and Machine Learning Intern Summer 2018

Built an executable pipeline able to detect thyroid cancer cells with a given microscopic thyroid biopsy image, reducing the diagnosis time of thyroid cancer for a patient from 30 minutes to 5 minutes

• Developed area-average down-sampling algorithm to transform a super high-resolution image (460k x 230k) into 24 (6973 x 2418) sub-images

• Improved image preprocessing by using K-Means and K-Nearest Neighbors clustering methods based on pixel intensity distribution to propose regions that potentially contain thyroid cancer cells, reducing computational time by 60%

• Trained a Convolutional Neural Network using the Single Shot Multibox Detector (SSD) algorithm that detects thyroid cancer cells with 92.5% accuracy and 97.8% recall

Stock Market Sentiment Analysis

ML Co-Lead Fall 2020

Performed large-scale social media data analysis and utilized Natural Language Processing (NLP) methods to identify patterns and create sentiment metrics to contextualize market movement and enrich Cheddar news’ daily market reporting

• Built a Reddit scraper using Python to collect over 2 million posts from Reddit through the Pushshift API

• Cleaned and vectorized texts into trainable features using Term Frequency – Inverse Document Frequency (TF-IDF) algorithm

• Trained a Naïve Bayes classifier (84% accuracy and 93% recall) and a Logistic Regression Classifier (86% accuracy and 91% precision) that can predict whether a post is “bullish” or “bearish” on the stock market

• Performed sentiment analysis and established a set of sentiment metrics/indicators to track Stock Market’s daily sentiment on Reddit and detect anomalies

• Visualized relationship between Stock Price movement and Sentiment trends on Social Media

• Presented to the CTO of Cheddar Inc. on the applicability and limitation of our system to their daily news reporting Skin Care/Beauty Product Recommendation System

Co-Lead Spring 2019

Collaborated with a frontend team and developed a personalized skin care recommendation system allowing users to search for products while targeting multiple desired effects and satisfying various constraints

• Built a pipeline that uses Boolean Search to filter out unrelated data with a given input

• Vectorized relevant product descriptions and user reviews into input feature matrix using TF-IDF

• Implemented Singular Value Decomposition (SVD) algorithm to reduce dimensionality of input matrix to speed up calculation of Cosine Similarity scores by 50%

Analysis of Data Science Topic Popularity on Stack Exchange Summer 2020 Explored and used quantitative analysis on a dataset from Stack Exchange about the most popular Data Science topics and made recommendations on the type of content that a Data Science Education Company should make.

• Used SQL to query 10k post data in 2019 from Stack Exchange Online Database

• Cleaned and analyzed post-tags data with pandas and visualized trends of the most popular Data Science topics with seaborn.

• Identified and proposed 3 of the most trending Data Science content based on users’ interests on Stack Exchange Predictive Modeling for Titanic Survival Rate Summer 2020 Outlined a workflow for building and optimizing a Logistic Regression model, a KNN model, and a Random Forest model using pandas and scikit-learn based on the Titanic dataset from Kaggle

• Cleaned dataset through standardization, data imputation, and removal of irrelevant or duplicated data using pandas

• Visualized data distributions and engineered features through quantile binning and one-hot encoding based on passenger status and historical context; selected the best features using RFECV

• Tuned hyperparameters and optimized the Random Forest model using grid search and cross validation to achieve 82% accuracy ADDITIONAL EXPERIENCE

• Organized and hosted Language exchange social events as a marketing intern at CO&CO Hokkaido, facilitating interactions between English and Japanese language learners. Surveyed/collected more than 200 student feedbacks and interpreted results into actionable marketing insights

Contact this candidate