Sign in

Data scientist who loves to dive into data

Atlanta, Georgia, United States
May 03, 2019

Contact this candidate


Donghua Cai

Address: Atlanta, GA

Phone: 716-***-****



Programming Language: Python, R, SQL, Matlab, SAS, Scala, JavaScript, PHP, HTML, CSS, LA TEX

Tools: Excel, Apache Spark, Hadoop, Pig, Hive, Docker, ArcGIS, Git

Visualization: matlibplot, ggplot, D3.js, Tableau PROJECTS JAN 2017 - PRESENT

Parkinson’s Disease (PD) Patient Classi cation and Prediction Jan - Apr 2018

Built a pipeline to extract, transform, and load Parkinson’s Progression Markers Initiative (PPMI) dataset using Apache Spark with Scala.

Used Python scikit-learn to conduct feature selection, algorithm selection, cross validation, and param- eter tuning. Increased model accuracy of early stage PD patient classi cation from 89% to 97%. Make Creating Meetup Events Faster for Event Hosts Mar - Aug 2018

Built a web application by PHP with Google Map to booster creating meetup events for event hosts by providing venue & member recommendation according to their input of category and topic.

Led and coordinated project members to scrape Meetup data of 150,000+ past events, 3,500 groups, in 50 cities and 30,000+ members in the greater Atlanta Area by Python with Meetup API, cleaned data with OpenRe ne, and built database with SQL.

Applied PCA to reduce the dimension of the dataset and conducted K-means algorithm using Python to cluster venues based on the interest topics of the group which hosting the event.

Received positive feedback from target users about e ciency and user friendliness of the web Application. Identify Fraud from Enron Email Feb - May 2017

Examined and cleaned 500,000+ emails from 146 employees of the Enron Corporation.

Optimized hyperparameter of Random Forest, Naive Bayes, SVM algorithms using Python.

Improved the F1 score of the classi er from 0.36 to 0.7 by creating new features and selecting features with high variance.

Wrangling OpenStreetMap (OSM) Data Aug - Oct 2017

Performed data wrangling on unstructured data with Python. Parsed GBs of XML le, processed missing values, audited data quality of 4,220,611 nodes and 540,346 ways.

Reported that the most popular cuisine in the area is American food, and only 0.25% of the street have sidewalk information.

Proposed a leisure facility map by querying facility information in OSM data. EXPERIENCE

Graduate Research Assistant, Georgia Tech Jan 2015 - Dec 2017

Implemented a Numerical Ocean Modeling system in Linux environment with TBs of scienti c data to simulate coastal upwelling processes near a small-scale coastal promontory.

Analyzed and validated the numerical ocean modeling by comparison against eld experiments using Matlab.

Published research papers in premier scienti c Journals. EDUCATION

Georgia Institute of Technology, Atlanta, GA 2014 - May 2019 M.S., Computational Science and Engineering & Civil Engineering University at Bu alo, Bu alo, NY

M.S., Geographic Information Systems 2011 - 2014

Wuhan University, Wuhan, Hubei, China

B.S., Geographic Information Systems 2007 - 2011




Place Student Poster Presentation Award, the ELDAAG/CAGONT Annual Meeting. Oct. 2013

Contact this candidate