Sign in

Data Civil Engineering

Atlanta, Georgia, United States
January 04, 2019

Contact this candidate


Donghua (Lyla) Cai

Phone: +1-716-***-****




Hands-on experience in working with large structured and unstructured dataset, performing data cleaning and visualization, and applying machine learning techniques mainly using Python, R, and SQL. Pro ciency in mastering new knowledge, techniques, and programming language quickly. EDUCATION

Georgia Institute of Technology, Atlanta, GA Expected: May 2019 M.S., Computational Science and Engineering & Civil Engineering University at Bu alo, Bu alo, NY

M.S., Geographic Information Systems 2011 - 2014

Wuhan University, Wuhan, Hubei, China

B.S., Geographic Information Systems 2007 - 2011


Parkinson’s Disease (PD) Patient Classi cation and Prediction

Built a pipeline to extract, transform, and load Parkinson’s Progression Markers Initiative (PPMI) dataset using Apache Spark with Scala.

Used Python scikit-learn to conduct feature selection, algorithm selection, cross validation, and param- eter tuning. Increased model accuracy of early stage PD patient classi cation from 89% to 97%. Make Creating Meetup Events Faster for Event Hosts

Built a web application by PHP with Google Map to booster creating meetup events for event hosts by providing venue & member recommendation according to their input of category and topic.

Led and coordinated project members to scrape Meetup data of 150,000+ past events, 3,500 groups, in 50 cities and 30,000+ members in the greater Atlanta Area by Python with Meetup API, cleaned data with OpenRe ne, and built database with SQL.

Applied PCA to reduce the dimension of the dataset and conducted K-means algorithm from Python scikit-learn module to cluster venues based on the interest topics of the group which hosting the event.

Received positive feedback from target users about e ciency and user friendliness of the web Application. Identify Fraud from Enron Email

Examined and cleaned 500,000+ emails from 146 employees of the Enron Corporation.

Optimized hyperparameter of Random Forest, Naive Bayes, SVM algorithms using Python scikit-learn.

Improved the F1 score of the classi er from 0.36 to 0.7 by creating new features and selecting features with high variance.

Wrangling OpenStreetMap (OSM) Data

Performed data wrangling on unstructured data with Python. Parsed GBs of XML le, processed missing values, audited data quality of 4,220,611 nodes and 540,346 ways.

Reported that the most popular cuisine in the area is American food, and only 0.25% of the street have sidewalk information.

Proposed a leisure facility map by querying facility information in OSM data. EXPERIENCE

Graduate Research Assistant, Georgia Tech Jan 2015 { Dec 2017

Implemented a Numerical Ocean Modeling system in Linux environment with TBs of scienti c data to simulate coastal upwelling processes near a small-scale coastal promontory.

Analyzed and validated the numerical ocean modeling by comparison against eld experiments using Matlab.

Published research papers in premier scienti c Journals. SKILLS

Programming Language: Python, Matlab, R, SQL, Scala, JavaScript, PHP, HTML, LATEX

Tools: Apache Spark, Hadoop, Pig, Hive, Git

Visualization: matlibplot, ggplot, D3.js




Place Student Poster Presentation Award, the ELDAAG/CAGONT Annual Meeting. Oct. 2013

Contact this candidate