EDUCATION
Joseph Kuo
******@*******.***
Cornell University, Department of Computing and Information Science Ithaca, NY Master of Professional Studies (MPS) – Information Science Expected Dec 2021
• GPA: 3.72
Cornell University, Department of Computing and Information Science Ithaca, NY Bachelor of Arts, Information Science May 2020
• GPA: 3.66
DATA SCIENCE COMPETENCIES
Languages & Tools: Python, R, SQL, Jupyter, pandas, dplyr, scikit-learn, PySpark, Tableu, Java, HTML, CSS, JavaScript, Excel Techniques: Regression, Classification, Decision Trees, Clustering, Neural Nets, Sentiment Analysis, Data Mining, Data Visualization, Data Analysis, Dimensionality Reduction, Boosting, Optimization, Dashboarding, Statistics, A/B Testing, Experimental Design, Qualitative Analysis, Data Structure, Object-Oriented Programming RELEVANT EXPERIENCE AND PROJECTS
Institute for Information Industry Taipei, Taiwan
AI and Machine Learning Intern Summer 2018
Built an executable pipeline able to detect thyroid cancer cells with a given microscopic thyroid biopsy image, reducing the diagnosis time of thyroid cancer for a patient from 30 minutes to 5 minutes
• Developed area-average down-sampling algorithm to transform a super high-resolution image (460k x 230k) into 24 (6973 x 2418) sub-images
• Improved image preprocessing by using K-Means and K-Nearest Neighbors clustering methods based on pixel intensity distribution to propose regions that potentially contain thyroid cancer cells, reducing computational time by 60%
• Trained a Convolutional Neural Network using the Single Shot Multibox Detector (SSD) algorithm that detects thyroid cancer cells with 92.5% accuracy and 97.8% recall
Stock Market Sentiment Analysis
ML Co-Lead Fall 2020
Performed large-scale social media data analysis and utilized Natural Language Processing (NLP) methods to identify patterns and create sentiment metrics to contextualize market movement and enrich Cheddar news’ daily market reporting
• Built a Reddit scraper using Python to collect over 2 million posts from Reddit through the Pushshift API
• Cleaned and vectorized texts into trainable features using Term Frequency – Inverse Document Frequency (TF-IDF) algorithm
• Trained a Naïve Bayes classifier (84% accuracy and 93% recall) and a Logistic Regression Classifier (86% accuracy and 91% precision) that can predict whether a post is “bullish” or “bearish” on the stock market
• Performed sentiment analysis and established a set of sentiment metrics/indicators to track Stock Market’s daily sentiment on Reddit and detect anomalies
• Visualized relationship between Stock Price movement and Sentiment trends on Social Media
• Presented to the CTO of Cheddar Inc. on the applicability and limitation of our system to their daily news reporting Skin Care/Beauty Product Recommendation System
Co-Lead Spring 2019
Collaborated with a frontend team and developed a personalized skin care recommendation system allowing users to search for products while targeting multiple desired effects and satisfying various constraints
• Built a pipeline that uses Boolean Search to filter out unrelated data with a given input
• Vectorized relevant product descriptions and user reviews into input feature matrix using TF-IDF
• Implemented Singular Value Decomposition (SVD) algorithm to reduce dimensionality of input matrix to speed up calculation of Cosine Similarity scores by 50%
Analysis of Data Science Topic Popularity on Stack Exchange Summer 2020 Explored and used quantitative analysis on a dataset from Stack Exchange about the most popular Data Science topics and made recommendations on the type of content that a Data Science Education Company should make.
• Used SQL to query 10k post data in 2019 from Stack Exchange Online Database
• Cleaned and analyzed post-tags data with pandas and visualized trends of the most popular Data Science topics with seaborn.
• Identified and proposed 3 of the most trending Data Science content based on users’ interests on Stack Exchange Predictive Modeling for Titanic Survival Rate Summer 2020 Outlined a workflow for building and optimizing a Logistic Regression model, a KNN model, and a Random Forest model using pandas and scikit-learn based on the Titanic dataset from Kaggle
• Cleaned dataset through standardization, data imputation, and removal of irrelevant or duplicated data using pandas
• Visualized data distributions and engineered features through quantile binning and one-hot encoding based on passenger status and historical context; selected the best features using RFECV
• Tuned hyperparameters and optimized the Random Forest model using grid search and cross validation to achieve 82% accuracy ADDITIONAL EXPERIENCE
• Organized and hosted Language exchange social events as a marketing intern at CO&CO Hokkaido, facilitating interactions between English and Japanese language learners. Surveyed/collected more than 200 student feedbacks and interpreted results into actionable marketing insights