HARSHUL SHAH
+1-469-***-**** ***********@*****.*** LinkedIn: in/harshulshah27 Portfolio Page
SUMMARY
Analytical and impact-driven Associate Data Scientist with experience applying machine learning, predictive modeling, and statistical techniques to solve real-world problems in sports, public policy, and product analytics. Adept in Python, SQL, and GCP, with hands- on experience in search optimization, recommendation systems, data mining, and ETL pipeline automation. Strong communicator and continuous learner with a passion for building insights that influence decision-making and enhance user experience. Passionate about leveraging data science to improve search relevance, customer experience, and operational efficiency in retail environments. PROFESSIONAL EXPERIENCE
Data Analysis Intern, FC Dallas, Frisco, TX (MLS) October 2024 - May 2025
Developed a predictive ML tool to forecast youth player physical development, integrating GPS, match, and training datasets using Python and BigQuery.
Automated ETL workflows using Python & Google Apps Script to clean and sync performance data, reducing manual entry time by 50%.
Built dashboards (Tableau, Google Data Studio) to visualize player metrics, coaches make real-time strategic decisions.
Coordinated cross-functional data communication between performance analysts, coaches, and technical staff. Data Engineering Intern, Gameplay Inc., California (Remote) May 2024 - August 2024
Engineered a cloud-automated ETL pipeline using Python & Databricks, GCP and Gemini APIs, extracting 40,000+ facility records alongside satellite imagery to enhance spatial insights across U.S. sports complexes.
Collaboratively developed a PostgreSQL database to store, structure, and query facility attributes and image metadata, optimizing retrieval for downstream analytics.
Built Snowflake data models with anomaly detection logic, streamlining delivery to APIs and public dashboards.
Reduced data pipeline latency by 73%, enabling real-time spatial insights. Intern, Ernst & Young, India February 2023 – June 2023
Conducted performance audits and automation of SQL-based audit trails for public records, reducing reporting lag by 65%.
Collaborated with stakeholders to present findings in clear, visual formats tailored for govt and non-technical audiences. Product Development Intern, Cy5.io, India June 2022 – August 2022
Collaborated on developing a Software-as-a-Service automated tool, leading to a contribution for 25% of the company’s services to provide firewall-secured cloud data on Google Cloud Platform.
Reduced cloud services configuration time by 35% using Python-based automation on Google Cloud Platform (GCP). Intern, MakeMyTrip, India January 2022 –February 2022
Developed Python scripts for recognizing unchecked APIs released for vulnerability checks, reducing unchecked VAPT for APIs by 25% and enhancing data availability for analysis. EDUCATION
Master of Science in Statistics &Data Science The University of Texas at Dallas, TX May 2025 GPA 3.11 BTech, Electronics & Communication with Computer Science Jaypee Institute of Information Technology May 2023 GPA: 3.52 Awards: Academic Excellence (Highest GPA in batch) TECHNICAL SKILLS
Programming: Python, SQL, R, SAS, C++, Javascript
ETL & APIs: REST APIs, PDI workflows, Cloud Scheduler, Airflow (basic), Typeform Machine Learning: Logistic Regression, Random Forest, SVM, Clustering, Time Series, GenAI, XGBoost, NLP (basic), Search and Recommendation Models
Tools & Platforms: Tableau, Power BI, Seaborn, Matplotlib, Google Data Studio, Streamlit ETL & Data Tools: Airflow (basic), REST APIs, BigQuery, PostgreSQL, Snowflake, Azure DevOps, Git Cloud & Infrastructure: Google Cloud Platform (GCP), Azure, Databricks, Gemini APIs Soft Skills: Stakeholder Communication, Critical Thinking, Data Storytelling, Agile Collaboration PROJECT EXPERIENCE
Scalable Movie Recommendation System Using Apache Spark(Master’s Project) Jan 2025 – May 2025
Developed a collaborative filtering recommendation model using PySpark on Databricks, processing 100K+ user-item interactions.
Designed distributed ETL pipelines with HDFS and Spark DataFrames for scalable data ingestion and transformation.
Tuned ALS model via Spark MLlib, achieving an 18% reduction in RMSE over baseline models. Potential Soccer Player Forecasting Model October 2024 - December 2024
Engineered a player recommendation system using Random Forest and K-Means clustering to analyze performance data from the 2016 season.
Achieved 30% faster scouting efficiency by generating personalized player suggestions, reducing manual search time for clubs and scouting analysts.
Fantasy Premier League Predictor Model May 2024 - Present
Trained and tuned ML models (Random Forest, Logistic Regression) using 100+ gameweeks of player performance data across 500+ athletes in Python and Sci-kit Learn.
Boosted prediction accuracy by 20%, leveraging time-series smoothing, fixture difficulty weighting, and trend factors.
Built an automated scoring system and data visualizations in Seaborn and Matplotlib to provide weekly player performance summaries with 24-hour turnaround.
Soil Of Worth (SOW) August 2022 – May 2023
Designed and trained a Random Forest-based recommendation engine to identify optimal crops for specific soil and weather conditions, improving planning accuracy by 30%.
Cleaned and standardized large-scale agronomic datasets, handling missing values, outliers, and schema inconsistencies to boost model precision by 25%.
Performed feature engineering on multi-dimensional environmental data and integrated geospatial insights to personalize recommendations across soil types.