Data scientist

Location:

New York City, NY

Posted:

June 19, 2019

Contact this candidate

Resume:

CHANG Z

732-***-****

******.****@*****.***

PROFESSIONAL SUMMARY:

A Data Scientist professional with 7 years of progressive experience in Data Analysis/Visualization, Statistical Modeling, Machine Learning, and Deep learning. Excellent capability in collaboration, quick learning and adaptation.

Experience in Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.

Experience in integrating data, profiling, validating and data cleansing transformation and data visualization using R and Python.

Good experience in using various Python libraries (Numpy, Scipy, matplotlib, python-twitter, Pandas, MySQL dB for database connectivity, Beautiful Soup).

Theoretical foundations and practical hands-on projects related to (i) supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines, neural networks, NLP), (ii) unsupervised learning (clustering, dimensionality reduction, recommender systems), (iii) probability & statistics, experiment analysis, confidence intervals, A/B testing, (iv) algorithms and data structures.

Sound understanding of Deep learning using CNN, RNN, ANN, reinforcement learning, transfer learning.

Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions in MySQL.

Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design.

Strong Experience in Big data technologies including Apache Spark, HDFS, Hive, MongoDB.

Hands on experience of Git.

Experience in applying machine learning algorithms for a variety of programs.

Experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling.

Good working experience in processing large datasets with Spark using Python.

Extensive knowledge on Azure Data Lake and Azure Storage.

Experience in migration from heterogeneous sources including Oracle to MS SQL Server.

Hands on experience in design, management and visualization of databases using Oracle, MySQL and SQL Server.

Experienced in Hadoop 2.x ecosystem and Apache Spark 2.x framework such as Hive, Pig, Scoop, Pyspark.

Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming.

Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages.

Experience in dimensionality reduction using techniques like PCA and LDA.

Intensive hands-on Boot camp on Data Analytics course spanning from Statistics to Programming including data engineering, data visualization, machine learning and programming in R, SQL.

Experience in data analytics, predictive analysis like Classification, Regression, Recommender Systems.

Good Exposure with Factor Analysis, Bagging and Boosting algorithms.

Experience in Descriptive Analysis Problems like Frequent Pattern Mining, Clustering, Outlier Detection.

Worked on Machine Learning algorithms like Classification and Regression with KNN Model, Decision Tree Model, Naïve Bayes Model, Logistic Regression, SVM Model and Latent Factor Model.

Hands-on experience on Python and libraries like Numpy, Pandas, Matplotlib, Seaborn, NLTK, Sci-Kit learn, SciPy.

Expertise and knowledge in TensorFlow to do machine learning/deep learning package in python.

Good knowledge on Microsoft Azure SQL, Machine Learning and HDInsight.

Good Exposure on SAS analytics.

Good Exposure in deep learning with Tensor flow in python.

Good Knowledge on Natural Language Processing (NLP) and Time Series Analysis and Forecasting using ARIMA model in Python and R.

Good knowledge in Tableau, Power BI for interactive data visualizations.

In-depth Understanding in NoSQL databases like MongoDB, HBase.

Experienced in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learning Studio, Azure Data Lake. Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.

Experience and Knowledge in developing software using Java, C++ (Data Structures and Algorithms) technologies.

Good exposure in creating pivot tables and charts in Excel.

Experience in developing Custom Report and different types of Tabular Reports, Matrix Reports, Ad hoc reports and distributed reports in multiple formats using SQL Server Reporting Services (SSRS).

Excellent Database administration (DBA) skills including user authorizations, Database creation, Tables, indexes and backup creation.

TECHNICAL SKILLS

Languages

Python, R, Matlab, JAVA, C

Python Libraries

Numpy, SciPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Keras, OpenCV, Glove, Seaborn, Beautiful Soup, ggplot2, MXNet, Caffe, H2O, PyTorch, Theano, Azure ML

Algorithms

Regression:

Linear/Nonlinear Regression, Logistic Regression

Clustering:

K-Means, K-Nearest Neighbors, Hierarchical Clustering

Classification:

Bayes Classifier, Support Vector Machine, Decision Tree, Random Forest, XGBoost, LightBGM, CATBoost, AdaBoost

Others:

Principle Component Analysis, Anomaly Detection, Recommendation System

NLP/Machine Learning/Deep Learning

Natural Language Processing:

LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis

Deep Learning:

Artificial Neural Network (ANN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short Term Memory (LSTM)

Cloud

Google Cloud Platform, AWS, Azure, Bluemix

Web Technologies

JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Data Modeling Tools

Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Big Data Technologies

Hadoop, Hive, HDFS, MapReduce, Pig, Kafka

Databases

SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools

MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools

Informatica Power Centre, SSIS.

Version Control Tools

Git

BI Tools

Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System

Windows, Linux, Unix, Macintosh HD, Red Hat

PROFESSIONAL EXPERIENCE

Florida Blue. –New York, NY Feb 2019 – June 2019

Data Scientist

Project: Predictive Modeling for Patient Adherence to Treatment

Summary: Used machine learning models to predict the probability of a consumer’s adherence to the treatment plan. Results are used to help business team to reach out to customers of high risk to discontinue the treatment plan and improve the rating of the company.

Responsibilities:

Identified data source through clinical knowledge, business sense, and comprehensive literature review.

Wrote SQL queries to build master dataset (>500 features) and preprocess data.

Used sophisticated feature engineering methods to transform diagnosis codes (ICD-10), procedural codes (CPT), drug codes (NDC, GPI), and pharmacy claim data.

Used natural language processing methods to extract keywords from clinical and electronic medical record data.

Used libraries including caret, dplyr, and tidyverse for exploratory data analysis and data transformation.

Experimented zero imputation, single imputation, and multiple imputation to fill missing data.

Performed feature selection methods for dimensionality reduction with forward, backward, bi-direction and best subset methods as well as information value and feature importance rank.

Built machine learning models using logistic regression, support vector machine, random forest, gradient boost machine (GBM), and extreme gradient boost machine (XGBoost) algorithms.

Used grid-search method for hyper-parameter tuning.

The models built have AUC-ROC over 0.8 and are delivered for production.

Anthem, Inc. –Wallingford, CT Jun 2017 – Jan 2019

Data Scientist

Project 1: Data Mining for Health Care Claim Severity Prediction

Summary: Built classification models to predict the fraudulent claims by severity in real-time reducing the time for execution. Implemented the models as a predictive solution for finding the fraudulent claims. Forwarded the high-risk claims for further investigation.

Responsibilities:

Wrote SQL queries to select data of interest and export them to csv files. Performed large data read/writes to and from csv and excel files using Python pandas.

Performed exploratory data analysis (EDA) with Pandas, NumPy, and Matplotlib to find trends and clusters.

Performed data cleaning process with Python. Fill missing values (data imputation) with mean/medium/mode. Use Backward - Forward filling methods for time series data. Removed duplicated and mistaken information.

Performed data transformation by rescaling, normalizing, and log transforming variables.

Brainstormed new features (feature engineering) to improve prediction accuracy.

Analyzed data and performed data preparation by applying historical model on the data set in AZURE ML.

Performed class imbalance reduction with oversampling and 80-20 undersampling methods.

Used one-class support vector machine method to detect anomalies in health care data.

Applied supervised methods (XGBoost, LightBGM, AdaBoost, Random Forest, Naive Bayes), unsupervised method (Autoencoders), and hybrid method (PCA regression) to detect medical insurance fraud using Python Scikit-Learn, TensorFlow, and Keras.

Tuned hyperparameters with grid search method.

Evaluated precision and recall with F-Score, G-Measure. Used ANOVA to evaluate the statistical significance.

Collaborated with Data Engineers and Software Developers to develop experiments and deploy solutions to production.

Automated and own the end-to-end process of modeling and data visualization.

Created and published multiple dashboards and reports using Tableau.

Communicated and coordinated with other departments to collection business requirement.

Used Git 2.x for version control with Data Engineer team and Data Scientists colleagues.

Used Agile methodology and SCRUM process for project developing.

Project 2: Natural Language Processing for Fraud Detection

Summary: Executed topic modeling for finding different topics based on the notes made corresponding to the claimant’s claim. Attributed the resulting topics to classify fraud and non-fraud categories.

Responsibilities:

Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts.

Performed data cleaning and feature selection using Classification and Regression Tree (CART) method.

Tackled highly imbalanced fraud dataset using undersampling, oversampling and SMOTE methods.

Generated topic features from unstructured text data with Word2vec, Latent Dirichlet Allocation (LDA) method.

Compared one-hot encoding and mean encoding methods for categorical feature transformation.

Worked on Text Analytics, Naive Bayes, Sentiment analysis, creating word clouds platforms with PyTorch.

Planed, developed, and applied leading-edge analytic and quantitative tools and modeling techniques to help clients gain insights and improve decision-making.

Applied various machine learning and statistical learning methods including decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering in python.

Compared the performance of different classification models including Support Vector Machine, Random Forest, and Deep Neural Network with and without using topic features.

Tuned hyperparameters with grid search method based on cross validation error.

Evaluated model performance with accuracy, precision, recall, F1 score and AUC-ROC Curve.

Wrote research reports describing the experiment conducted, results, and findings and also make strategic recommendations to technology, product, and senior management.

Capital One – Dallas, TX Jan 2016 – May 2017

Data Scientist

Project 1: Data Mining for Risk Assessment

Summary: Incorporated models built in Python and R into the business processes to assess the risk involved with a credit card transaction.

Responsibilities:

Wrote Pig Scripts to perform ETL procedures on the data in HDFS.

Supported MapReduce Programs running on the cluster.

Configured Hadoop cluster with Namenode and slaves and formatted HDFS.

Implemented Data Exploration to analyze patterns and to select features using Python NumPy, Pandas, and Matplotlib.

Generated new features with existing features to improve model performance.

Used cost-based sampling and K-means method to synthesize more fraudulent samples.

Generated IF-THEN rules with Decision Tree as risk scoring rules.

Built a Convolutional Neural Network (CNN) based fraud detection framework with TensorFlow and Keras.

Created reports and dashboards, by using D3.js and Tableau 9.x, to explain and communicate data insights, significant features, models scores to both technical and business teams.

Used Git 2.x for version control with Data Engineer team and Data Scientists colleagues.

Used Agile methodology and SCRUM process for project developing.

Project 2: Credit Card Customer Churn Prediction

Developed models that predict whether a customer’s propensity to churn leveraging the information related to shopping behaviors, demographics, payment frequency, credit history, etc.

Responsibilities:

Participated in Data Acquisition with Data Engineer team to extract historical data by using Hadoop MapReduce and HDFS.

Developed Hive queries for analysis, and exported the result set from Hive to MySQL using Sqoop after processing the data.

Worked on loading the data from MySQL to HBase where necessary using Sqoop.

Performed Data Enrichment jobs to deal missing value, to normalize data, and to select features by using HiveQL.

Used undersampling, oversampling, SMOTE methods to handle imbalanced dataset and improve prediction accuracy.

Built Factor Analysis and Cluster Analysis models using Python Scikit-Learn to classify customers into different target groups.

Applied survival analysis to estimate the customer survival curve and calculate customer lifetime value.

Used Classification and Regression Tree (CART) Method to select import features.

Developed an ensemble system incorporating major voting and involving Multilayer Perceptron (MLP), Logistic Regression (LR), decision trees (J48), Random Forest (RF), Radial Basis Function (RBF), and Support Vector Machine (SVM) as the constituents using Python Scikit-Learn, SciPy, TensorFlow and Keras.

Generated IF-THEN rules with Decision Tree, which act as predictors for “early warning” in churn modeling.

Project 3: Design of A/B Testings to Evaluate Recommendation System Design

Designed a series of A/B testing experiments to assess customers’ acceptance of a new recommendation system design.

Responsibilities:

Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.

Designed A/B testings, including customer funnels, metrics, test statistics for testing the business performance of the new recommendation system.

Utilize SQL, Excel and several Marketing/Web Analytics tools (Google Analytics, AdWords) in order to complete business & marketing analysis and assessment.

Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.

Analyzed the partitioned and bucketed data and compute various metrics for reporting.

Using R’s dplyr and ggplot2 packages, performed an extensive graphical visualization of overall data, including pageview, click through rate/probability, and daily active user.

DOITT-City Of New York. – NY May 2014 – Dec 2016

Data Scientist

Responsibilities:

Supported MapReduce Programs running on the cluster.

Configured Hadoop cluster with Namenode and slaves and formatted HDFS.

Used Oozie workflow engine to run multiple Hive and Pig jobs.

Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Hadoop MapReduce and HDFS.

Developed multiple MapReduce jobs in java for data cleaning and pre-processing.

Statistical Modelling with ML to bring Insights in Data under guidance of Principal Data Scientist.

Data modeling with Pig, Hive, Impala.

Ingestion with Sqoop, Flume.

Used SVN to commit the Changes into the main EMM application trunk.

Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.

Worked with Ajax API calls to communicate with Hadoop through Impala Connection and SQL to render the required data through it. These API calls are similar to Microsoft Cognitive API calls.

Good grip on Cloudera and HDP ecosystem components.

Used ElasticSearch (Big Data) to retrieve data into application as required.

Developed multiple MapReduce jobs in java for data cleaning and preprocessing.

Developed scalable machine learning solutions within a distributed computation framework (e.g. Hadoop, Spark, Storm etc.).

Analyzed the partitioned and bucketed data and compute various metrics for reporting.

Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.

Worked on loading the data from MySQL to HBase where necessary using Sqoop.

Developed Hive queries for Analysis across different banners.

Exported the result set from Hive to MySQL using Sqoop after processing the data.

Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.

Have hands on experience working on Sequence files, AVRO, HAR file formats and compression.

Used Hive to partition and bucket data.

Experience in writing MapReduce programs with Java API to cleanse Structured and unstructured data.

Created HBase tables to store various data formats of data coming from different portfolios.

Worked on improving performance of existing Pig and Hive Queries.

Involved in full SDLC of BI Project including Data Analysis, Designing, Development of Data Warehouse environment.

Used Oracle Data Integrator Designer to develop processes for extracting, cleansing, transforming, integrating, and loading data into data warehouse database.

Experience in Developing and customizing PL/SQL packages, procedures, functions, triggers and reports using Oracle SQL Developer.

Responsible for designing, developing and testing of the ETL strategy to populate the data from various source systems (Flat files, Oracle).

Worked with the Business units to identify data quality rule requirements against identified anomalies.

Develop Data Mapping, Join and queries – Validation, and addressing/fixing data queries raised by project team in a timely manner.

Worked closely with Business analyst and interacted with the Business users to gather new business requirements and to understand the accurate business and current requirements.

Whole Foods Market – Austin, TX May 2011 – Aug 2013

Data Analyst Co-op

Responsibilities:

Integrated data from multiple data sources or functional areas, ensures data accuracy and integrity, and updates data as need using SQL and Python.

Expertise leveraging SQL, Excel and Tableau to manipulate, analyze and present data.

Performs analyses of structured and unstructured data to solve multiple and/or complex business problems utilizing advanced statistical techniques and mathematical analyses.

Developed advanced models using multivariate regression, Logistic regression, Random forests, decision trees and clustering.

Used Pandas, Numpy, Seaborn, Scikit-learn in Python for developing various machine learning algorithms.

Build and improve models using natural language processing (NLP) and machine learning to extract insights from unstructured data.

Experienced working with distributed computing technologies (Apache Spark, Hive).

Applied predictive analysis and statistical modeling techniques to analyze customer behavior and offer customized products, reduce delinquency rate and default rate. Lead to fall in default rates from 5% to 2%.

Applied machine learning techniques to tap into new markets, new customers and put forth my recommendations to the top management which resulted in increase in customer base by 5% and customer portfolio by 9%.

Analyzed customer master data for the identification of prospective business, to understand their business needs, built client relationships and explored opportunities for cross-selling of financial products. 60% (Increased from 40%) of customers availed more than 6 products.

Collaborated with business partners to understand their problems and goals, develop predictive modeling, statistical analysis, data reports and performance metrics.

Participate in the on-going design and development of a consolidated data warehouse supporting key business metrics across the organization.

Designed, developed, and implemented data quality validation rules to inspect and monitor the health of the data.

Dashboard and report development experience using Tableau.

EDUCATION

M.S., Statistics, University of Minnesota-Twin Cities, 09/2013 – 05/2014

B.S., Computer Science, University of Texas-Austin, 09/2009 – 08/2013

Contact this candidate