Greensboro, North Carolina, United States
October 16, 2016

A senior data scientist with experience on working with multiple data platforms.

Excellent hands-on experience in data collection and organization, data processing and analysis, and data visualization with experience working on numerous projects in a range of therapeutic areas.

Excellent scientific research, method and application development record with extensive programming experience and ability to move ideas into new innovations.

Successfully collaborated with scientists and management to develop and promote the use of tools and technology to improve operational efficiency.


Computer Languages: Java/Scala/Spark/Hadoop, Python, C#/C++/C, Matlab &R, SAS and PL-SQL/Postgresql

Software Skills: Tableau, Pipeline Pilot & KNIME, Spotfire

Data warehouse


Big data analysis

Machine learning


Data visualization

Statistical analysis

Data mining

Data management

Programming Achievement

Designed and implanted data extraction and collection from documents (XML and PDF) and e-health records

Initiated, developed and maintained the in-house molecular toolkit

Developed molecular modeling tools for automated data collection and validation, model building and model performance tracing.

Developed and implemented two propriety non-linear methods that improved model accuracy.

Designed and implemented a novel diversity analysis method for high dimensional large data clustering, classification and visualization

Developed and implemented library design and analysis modeling module.

Implemented modeling package for data mining and data mapping.

Implemented hybrid variable selection-PLS and KNN

Combined the genetic algorithm, hill-climbing and simulated annealing with pattern recognition methods to develop quantitative structure property relationship package for data analysis and model building.

Developed and implemented a propriety novel textual descriptor generation module.

Designed and implemented tools to support modeling group for model building, parameter adjustment and optimization, model performance assessment and data visualization.


WFUBMC, Winston-Salem, NC 2016-

Data Analyst (consultant)

Design, coded and implemented Java code for scheduling data extraction, transformation and loading to fulfill internal client request.

Extracted, transformed and load data utilizing Python, Java and Spark platforms. Responsible for analyzing health data. Evaluated and tested the data solution modules and created data validation programs

Designed, developed and implemented data extraction packages for data extraction and collection from healthcare documents (PDF, XML and CSV) to database for supporting medical research

CUBIST (MERCK), Lexington, MA 2012-2015

Principal Scientist (Discovery Technologies)

Developed predictive antibacterial resistance and permeability models using a variety of machine learning techniques including random forests, Bayesian classification and principal component analysis. Deconvoluted generated models to identify good/bad chemical features in order to proactively guide the design of prospective targets.

Analyzed HTS screening data for human disease project and provided the follow-up support from hits confirmation & concentration response to lead identification as well as series selections, included data mining from in-house and external structure databases – Using outcome from data analysis to make recommendation and decisions to support projects.

Worked closely with project teams and members, providing proactive input, including suggestions of compounds to synthesize and identification of key compounds for in vitro/in vivo testing based on multiple dimensional data analysis results, profiling, series selection and advancement. Focusing on optimizing properties such as membrane penetration and resistance avoidance while maintaining target activity in order to improve MICs against difficult strains using linear and nonlinear models (data aggregation, descriptor selection, dimensional reduction and machine learning).

Designed project home pages and application tools to support project data retrieval, organization, visualization and SAR analysis.

Initiated a poly-pharmacology project targeting multiple ion channels for the treatment of post-operative pain and led the project through exploratory stages and follow-up working closely with biology co-lead.

TARGACEPT, Winston-Salem, NC 2002-2012

Senior Scientist (Drug Discovery and Development) 2007-2012

Scientist (Drug Discovery and Development) 2002-2007

Developed linear and non-linear predictive models of Random Forest, Self-organizing Map, Partial Least Square and k-Nearest Neighbor for potential nicotinic a4b2, a3b4, a6* and a7 agonist/antagonist and channel modulator recognition.

Designed and implemented a novel diversity analysis method for high dimensional large data clustering, classification and visualization.

Utilized statistical methods to develop in-vitro/in-vivo translations used to identified source of both activity and adverse effects and to better understand the efficacy of nicotinic ligands in pain/smoking concession models, and provide support for selection process.

Initiated and automated web-based structure management and prediction site for medicinal chemists. This concept led to the development of in-house automated data sharing system.

Developed and implemented a propriety Multiple Target Prediction module with novel textual descriptors for use as routine predictions of various in-house assays as well as ADMET for all registered virtual structures and synthesized compounds.

Combined the genetic algorithm, hill-climbing and simulated annealing with pattern recognition methods to develop quantitative structure property relationship package for data analysis and model building.

Supervised contractors, post doctors and interns and provided support of postdoctoral grogram.


Research Associate (Tropsha Lab: The Lab for Molecular Modeling, School of Pharmacy)

Developed and implemented modeling packages for library design, large multidimensional data diversity analysis, data mining and data mapping.

Collaborated with Rohm and Haas’s computational chemists developing methodology for descriptor scaling and data analysis.

Developed ADME models using SAS package and installed them at GSK intranet for use as routine prediction tools for DNPK scientists.


PhD, Organic Chemistry (Computational Chemistry)

Nankai University, Tianjin, China

MS, Analytical Chemistry (Pattern Recognition in Chemistry)

Changchun Institute of Applied Chemistry, Academic Sinica of China, Changchan, China

BS, Chemistry

Jiangxi Normal University, Nanchang, China

Other Training

Data Science, Cursera, Johns Hopkins Univesity

Statistics and R for the life sciences, Harvard, MA, edX

Introduction to Computational Thinking and Data Science, MITx, edX

