Data Scientist Data Analyst Machine Learning Model AWS Azure

Location:

Santa Clara, CA

Posted:

June 26, 2025

Contact this candidate

Resume:

YUNHONG YANG

617-***-**** **************@*****.*** LinkedIn Danville, CA(open to relocate)

EXPERIENCES

FASHION VISIONAIRE LLC - Fashion brand focused on e-commerce and retail partnerships Product Data Analyst E-commerce Analytics & Operations New York, NY

2025

● Automated inventory updates and pricing logic by integrating Shopify with Mirakl and other retail platforms using API-based Python scripts, reducing manual updates by 40%.

● Analyzed campaign performance and user behavior using Google Analytics and Shopify metrics, guiding A/B testing and content optimizations that improved CTR and conversion rates.

● Prompt-engineered and fine-tuned GPT models via API to generate product descriptions, marketing copy, and internal tools, improving content creation speed and maintaining consistent brand voice.

● Delivered monthly performance reports for executive review, integrating KPIs across sales, inventory, and marketing via Tableau and custom JavaScript dashboards tailored for sales team tracking.

● Managed supply chain operations, including drop shipping logistics, EDI setups, and real-time inventory control across multiple channels, improving order accuracy and delivery timelines. D.C. WITNESS - Non-profit organization that tracks violent crime data for judicial transparency Data Scientist Data Migration & Machine Learning Washington, D.C.

2024

● Supported the journalism team with data migration and visualization needs, including extracting key statistics, creating dashboards, and debugging database issues to ensure accurate reporting.

● Developed machine learning models (Logistic Regression, Decision Trees) using AWS SageMaker to predict criminal case outcomes based on factors like time, location, and presiding judge, providing actionable insights and improving decision-making accuracy by 15% to support investigative journalism.

● Developed an end-to-end MySQL database + ETL pipeline for 10K+ crime records; automated data ingestion replacing manual Airtable, boosting efficiency and data access speed by 67%.

● Summarized 10,000+ PDFs using GPT-4 API, cutting processing time by 75% through prompt engineering and fine-tuning.

● Improved data driven decision-making processes by 10% through advanced data visualization and reporting using Python libraries (Plotly, Folium), enabling faster and clearer crime trend insights for stakeholders. HILDRETH INSTITUTE & DATAKIND - Non-profit organization that analyzed student loan Data Scientist NLP & Data Visualization

Washington, D.C.

2023 - 2024

● Aimed to uncover insights for annual and research reports to guide policy recommendations and created visualizations to support non-technical team members in data-driven decision-making.

● Built custom sentiment analysis pipelines for CFPB student loan complaints using TextBlob, identifying 3 high-impact timeframes with major emotional shifts.

● Applied A/B testing to evaluate policy shifts' emotional impact, influencing strategic reporting.

● Fine-tuned a BERT model for emotion recognition, streamlining future content labeling by automating the classification process, resulting in a 30% improvement in processing speed.

● Updated Tableau dashboard based on senior leadership requirements to increase accessibility to complex data sets. YIDUN PRIVATE EQUITY FUND - PE firm specializing in stock trading with $30 millions AUM Quantitative Researcher Predictive Modeling & Time Series Analysis Shenzhen, China

2023

● Facilitated the development of a predictive model for CSI300 index movements, achieving 81%+ accuracy by applying Random Forest, and Naive Bayes. Worked with the Quant team to incorporate model insights into trading strategies.

● Enhanced stock market forecasting precision by 0.2% using Random Forest, Naive Bayes, Seasonal ARIMA models, identifying macroeconomic trends such as year-end real estate recovery, which informed portfolio allocation decisions.

● Automated intraday data ETL with BaoStock API, storing cleaned data in Parquet using multi-core GPU processing.

● Applied automated anomaly detection and historical cross-validation to maintain data integrity across 5,000+ stocks.

● Helped present model findings and market insights to cross-functional stakeholders, translating quantitative results into actionable investment insights alongside portfolio managers and analysts. FUTUREWEI TECHNOLOGIES - Huawei's U.S. subsidiary for tech research and development Data Analyst Data Pipeline & HR Analytics

Framingham, MA

2021 - 2022

● Developed Python-based data pre-cleaning pipelines utilizing Pandas, NumPy, and SciPy, automating data validation processes to reduce processing time by 2 hours per week and improve data quality by 17.3%.

● Cleaned and organized data for 3,000+ interviewees across 10+ universities using Python and Pandas, improving data accuracy by 8.7%, enhancing predictive modeling reliability, and reducing manual application review time by 25%.

● Designed and developed interactive Tableau dashboards for HR metrics, integrating data from multiple sources to visualize key insights and empower HR leadership with real-time, data-driven decision-making capabilities. PROJECTS & HACKATHONS

Empathy AI Coach: Emotional Support for Weight Management Role: AI Engineer & Frontend Developer Google's Women Techmakers presents She Builds AI Nov 2024

● Developed a kind, patient, and motivational AI coach addressing the weight management problem, while providing both emotional support and practical guidance to woman users.

● Prompted and fine-tuned the Gemini large language model based on user information and tailored prompt strategies, embedding it into an HTML framework via Gemini APIs.

● Developed the HTML using JavaScript to store conversation histories within a single session, enabling context-aware and seamless conversational experiences without resetting the model for each interaction.

● Contributed to the UI/UX design, crafting an intuitive chatbot interface that emphasized empathy and approachability to foster user trust and engagement.

Dating Relationship Analysis Using Reddit Archive Data Sep 2023 - Dec 2023

● Processed and cleaned extensive datasets using PySpark, including a Comments dataset with 23 columns and a staggering 4.6M+ rows, as well as a Submissions dataset comprising 70 columns and 378K+ rows.

● Applied machine learning models such as linear regression, random forest, and Support Vector Machine (SVM) to predict the popularity of Reddit posts based on various features.

● Implemented Latent Semantic Analysis(LSA), Latent Dirichlet Allocation(LDA), and Non-Negative Matrix Factorization(NMF) for topic modeling to summarize the comments and submissions on Reddit.

● Applied the Hugging Face Transformers library for sentiment analysis to extract emotions, analyzing the relationship between emotional changes and SPX stock data, and conducted keyword extraction using the sentiment analysis model to identify and categorize the most prominent words within each sentiment group, revealing the key drivers of public opinion. Image Captioning - Unveiling Narratives Through Images Sep 2023 - Dec 2023

● Developed and implemented an attention model for image captioning using the COCO dataset Person category.

● Integrated Convolutional Neural Network(CNN) and Recurrent Neural Network(RNN) architectures, leveraging their respective strengths to effectively capture and highlight significant image features.

● Processed a substantial dataset of 10 images from COCO, employing data preprocessing techniques and augmentation methods to ensure robust model training.

● Engineered a model framework where CNN served as the Encoder and RNN as the Decoder, seamlessly integrating dynamic context vectors to refine the precision of generated captions.

● Evaluated model performance using metrics such as Rouge Score and Word Error Rate to assess caption quality. EDUCATION

Georgetown University

M.S. Data Science & Analytics (GPA 3.97/4.00, DSAN Returning Student Scholarship, NLP TA) Washington, D.C.

2022 - 2024

Boston University

B.A. Mathematics (GPA 3.54/4.00, Dean’s list)

Boston, MA

2018 - 2021

ADDITIONAL EXPERIENCE

Technical: Programming & Tools (Python, R, SQL, Java, Matlab, SAS, Snowflake, Power BI, Tableau), Big Data & Cloud Computing

(AWS - SageMaker, Lambda, S3, RDS, Spark, PySpark, Hive, Azure, Google Cloud), Machine Learning Frameworks (TensorFlow, PyTorch, scikit-learn, Keras, Spark, YOLO), Python Libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, NLTK, spaCy, TextBlob, transformers), Data Management & Visualization (SQL, Data Quality Management, Unstructured Data) Publications: Yang, Y., Agha, H., Akman, B., Kheloussi, D., & He, Q. Understanding Challenges Repaying Student Loans: An In-Depth Study Utilizing BERT Models for Emotion Recognition and Issue Classification. In The Digitized Campus: Artificial Intelligence and Big Data in Higher Education (Accepted). SUNY Press.

Contact this candidate