Sr. Data Scientist/ Machine Leaning

Location:

Tappahannock, VA

Posted:

August 04, 2023

Contact this candidate

Resume:

JOE Z. SUN

Phone: 929-***-****; Email: ********@*****.***

Acknowledged as a high performing professional with well-honed skill sets in this field, I have rendered my best efforts to meet or exceed the expectations and demands of the organization.

DATA SCIENTIST/ MACHINE LEARNING ENGINEER

Executive Snapshot

Analytically minded self-starter with total IT experience of 18 years and 13+ years of experience collaborating with cross-functional teams and ensuring the accuracy and integrity of data and actionable insights. Demonstrated excellence in Machine Learning, data mining with large Structured, Unstructured data, performing data acquisition, data pre-processing, exploratory data analysis (EDA), statistical analysis, data validation, predictive modeling, data visualization, NLP, text analysis, transposing words, and phrases from unstructured data into numerical data to create business insights.

Synopsis

Strong analytical skills with a background in NLP, Computer Vision, Statistical Machine Learning, Big Data, Cloud Computing, Predictive Analytics, Machine Learning deployment, and maintenance.

Excellent academic credentials; Currently pursuing certification in Academy Database Programming backed by Master's in computer science from Kennesaw State University - Atlanta, GA, and BS in Healthcare Informatics from Mercer University - Atlanta, GA

Proficient in all Supervised Machine Learning techniques which include:

oNaïve Bayes, Classifiers, K-Nearest Neighbors, Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines, Random Forests, Neural Network Architectures, XGBoost, Survival Modeling

oNumPy stack (NumPy, SciPy, Pandas, and matplotlib) and Sklearn.

oTensorFlow and PyTorch for building, validating, testing, and deploying reliable deep learning algorithms for specific business challenges

oStatistical models on large data sets using cloud computing services such as AWS, Azure, and GCP.

oEnsemble algorithm techniques, including Bagging, Boosting, and Stacking

Possess knowledge of Natural Language Processing (NLP) methods, in particular BERT, ELMO, word2vec, sentiment analysis, Named Entity Recognition and Topic Modeling; Time Series Analysis with ARIMA, SARIMA, LSTM, RNNs

Well-versed in building unsupervised approaches such as K-Means, Gaussian Mixture Models, and Auto-Encoders. Program in R, Python, SQL, Spark, Scala, and MATLAB.

Efficiently handled visualizations using R-Programming (dplyr, ggplot2, plotly), Python (matplotlib, seaborn, plotly and Dash), and Tableau for end-user ad-hoc reporting.

Successfully designed custom BI reporting dashboards using Shiny, Shiny dashboard, and Plotly for providing actionable insights and data-driven solutions.

Expertise in working with and querying large data sets from big data stores using Hadoop Data Lakes, Data Warehouse, Amazon AWS, Cassandra, Redshift, Aurora, and other NoSQL and SQL data stores

Developed and tested hypotheses to improve the accuracy of predictions,

Increased accuracy of predictions by 10% by testing different hypotheses

Used hypothesis testing to identify what works and what doesn't, and make improvements accordingly

Experience with a variety of A/B testing tools and platforms

Designed and conduct A/B tests that are statistically significant

Conducted A/B tests to improve the effectiveness of marketing campaigns and website design

Proficient with the identification of patterns in data and using experimental and iterative approaches to validate findings.

Working alongside Data Engineers to build efficient ETL pipelines for algorithms and solutions. Possess advanced predictive modeling techniques to build, maintain, and improve real-time decision systems.

Experienced in employing Python, R, MATLAB, SAS, Tableau, and SQL for data cleaning, data visualization, risk analysis, and predictive analytics.

In-depth knowledge of statistical procedures that are applied in both Supervised and Unsupervised Machine Learning problems.

Developing different computer vision models for object classification and image recognition

Excellent communication skills (verbal and written) to communicate with clients, stakeholders, and team members.

Capability to quickly gain an understanding of niche subject matter domains, and design and implement effective novel solutions to be used by other subject matter experts.

Experience implementing industry-standard analytics within specific domains and applying data science techniques to expand these methods using Natural Language Processing, implementing clustering algorithms, and deriving insights.

Technical Skills

Deep Learning:

Recurrent Neural Networks, LSTM Networks, Artificial Neural Networks, Transfer Learning, Convolutional Neural Networks, Segmentation, Auto Encoding/Decoding

Programming Languages:

Python, R, MATLAB, Linux, Latex

Optimization Techniques:

Dynamic Programming, Convex Optimization, Non-Convex Optimization, Linear & Non-linear Programming, Monte Carlo Methods, Network Flows, Gradient Descent

Statistical Methods:

Bayesian Statistics, Hypothesis Testing, Factor Analysis, A/B testing, Stochastic Modeling, Factorial Design, ANOVA

Data Systems:

AWS (RDS, RedShift, Kinesis, EC2, EMR, S3), MS Azure, SQL, MySQL, NoSQL, Spark, Hive, Hadoop

IDEs:

Spyder, Jupyter, PyCharm, RStudio, Eclipse, Sublime, VSCode

Machine Learning Frameworks:

TensorFlow, PyTorch, Keras, Caffe

Unsupervised Learning:

Mixture Models, Hidden Markov Models, K-means Clustering, Hierarchical Clustering, Centroid Clustering, Principal Component Analysis, Singular Value Decomposition

Supervised Learning:

Naive Bayes Classifiers, Linear Regression, Logistic Regression, ElasticNet Regression, Multivariate Regression, Support Vector Machines, K-Nearest Neighbours, Decision Trees, Random Forests, Natural Language Processing, Time Series Analysis (ARIMA, G-ARCH, Prophet, RNNs), Survival Analysis (Poisson Regression, Cox Proportional Hazards, HMMs)

Python Libraries:

TensorFlow, NumPy, Pandas, SciPy, matplotlib, scikit-learn, Keras, PyTorch, PyBrain, Caffe, NLTK, StatsModels, Seaborn, Selenium, Flask, marshmellow, requests, sqlalchemy, SQLite

Development Tools:

GitHub, Git, IPython notebook, Trello, Jira, SVN

Professional Experience

Since March 2022 with BitMart Exchange – New Jersey

As a Sr. Data Scientist/ Machine Leaning

(BitMart is a company that delivers a global digital asset trading platform that provides

real-time trading services including Bitcoin (BTC), Ethereum (ETH) & Tether (USDT) trading)

This Investment Portfolio project centers on Natural Language Processing and Time-Series Analysis for Investment Portfolio Management. The project involves creating a data model to analyze news in predictive analytics relating to stock trending over 6-month periods. This analysis is used to re-balance stock portfolios.

Responsible for exploring various algorithmic trading theories and ideas.

Successfully applying data mining and optimization techniques in B2B and B2C industries and Machine Learning, Data/Text Mining, Statistical Analysis, and Predictive Modeling.

Aggregated and normalized data from various sources. Analyzed and visualized external market data and internal data.

Utilizing predictive modeling by making use of state-of-the-art methods.

Executing advanced machine learning algorithms utilizing Caffe, TensorFlow, Scala, Spark, MLLib, R, and other tools and languages needed.

Created an Object Character Recognition (OCR) process to automatize handwritten document scanning and analysis with different computer vision and Natural Language Processing techniques.

Produced operations, financial, and market product analytics. Interpreted data, and analyzed results.

Created reporting visualizations to automate weekly and monthly reports.

Developed queries in requirements management tools using statistical techniques to create ad hoc and recurring reports

Conducted the preliminary data exploration to ensure the modeling feasibility.

Involved in programming and scripting in R, Java, and Python.

Developing a data dictionary to create metadata reports for technical and business purposes.

Building reporting dashboard on the statistical models to identify and track key metrics and risk indicators.

Performing boosting method on predicted model to optimize the efficiency of the model.

Extracting source data from Amazon Redshift on the AWS Cloud platform.

Parsing and manipulating raw, complex data streams to prepare for loading into an analytical tool.

Exploring different regression and ensemble models in machine learning to perform forecasting.

Developing new financial models and forecasts.

Improving efficiency and accuracy by evaluating models in R.

Utilizing TensorFlow to design deep learning models.

Applied Machine Learning (ML), deep learning (DL), and natural language processing (NLP) in Artificial Intelligence (AI) applications.

Defining the source to target data mappings, business rules, and data definitions.

Performing end-to-end Informatica ETL Testing for custom tables by writing complex SQL Queries on the source database and comparing the results against the target database.

Worked well with stakeholders to accurately discern needs, advise on feasibility, and clarify requirements

Provided leadership, mentoring, and coaching to more junior team members.

Jan 2016 – Mar 2022 with First Advantage - Atlanta, GA

As a Data Scientist/ML

(I screen, you screen, we all screen with First Advantage. The company provides such

risk management services as employment background screening, drug and alcohol

testing, criminal record monitoring, and fingerprinting services, among others. It

helps clients find and retain qualified people quickly and efficiently. Its robust combination of searches and services provides the insights to help to reduce clients' risk while dramatically improving the quality of its talent and the reliability of renters. It also advises companies on how they can earn tax credits, training grants, and location-based incentives. The company has offices located in the US, Europe, the Middle East, Africa, and the Asia/Pacific region)

The process of drug discovery is complicated and involves many disciplines as they are often bounded by billions of testing, and huge financial and time expenditures. It takes twelve years on average to get a drug officially submitted. The data science and machine learning algorithms streamline and shorten this process, adding a perspective to each step from the initial screening of drug compounds to the prediction of success rate based on biological factors. Such algorithms can forecast how the compound will act in the body using advanced mathematical modeling and simulations instead of the "lab experiments". The idea behind computational drug discovery is to create computer model simulations as a biologically relevant network simplifying the prediction of future outcomes with high accuracy. It allows choosing which experiments should be done and incorporates all the new information in a continuous learning loop. Analogous techniques are used to predict the side effects of some particular chemical combinations.

Applied advanced analytics skills, with proficiency at integrating and preparing large, varied datasets, and communicating results.

Developed analytical approaches to strategic business decisions.

Performed analysis using predictive modeling, data/text mining, and statistical tools.

Collaborated cross-functionally to arrive at actionable insights.

Synthesized analytic results with business input to drive measurable change.

Assisted in the continual improvement of the AWS data lake environment.

Designed Architecture Diagram for the Feature Store to migrate to AWS

Identifying, gathering, and analyzing complex, multi-dimensional datasets utilizing a variety of tools.

Performed data visualization and developed presentation material utilizing Tableau.

Implemented application of various machine learning algorithms and statistical modeling techniques like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression, and Linear Regression using Python and determined performance.

Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization, and performed Gap analysis.

Coordinated with other teams such as the dashboard team to provide model results, and patient and medication information to populate the dashboard, and data integration tests were also performed

Implemented CI-CD pipeline for deployment and MLOps in MS Azure

Responsible for defining the key business problems to be solved while developing, and maintaining relationships with stakeholders, SMEs, and cross-functional teams.

Used Agile approaches, including Extreme Programming, Test-Driven Development, and Agile Scrum.

Provided knowledge and understanding of current and emerging trends within the analytics industry

Participated in product redesigns and enhancements to know how the changes will be tracked and to suggest product direction based on data patterns.

Applied statistics and organized large datasets of both structured and unstructured data.

Use of algorithms, data structures, and performance optimization.

Worked with applied statistics and/or applied mathematics.

Facilitated the data collection to analyze document data processes, scenarios, and information flow.

Determined data structures and their relationships in supporting business objectives and providing useful data visualizations in reports.

Promoted enterprise-wide business intelligence by enabling report access in SAS BI Portal and on Tableau Server.

Packages used: NumPy, Scikit-Learn, Pandas, Matplotlib, Seaborn, Github, PostgreSQL, Shap, a-priori, doc2vec, Keras, Azure DataBricks, MLFlow

Jan 2014 – Jan 2016 with East West Bank, Pasadena, California

As a Machine Learning Engineer - MLops/ Data Scientist

(East West Banking Corporation operates as a commercial bank. The Bank

offers deposits, saving accounts, loans, investments, credit cards, cash management,

and other banking services)

I used a Seq2Seq model to perform complex queries of convoluted tables and dashboards using natural language. My duties involved the creation and implementation of the system on EWB’s internal web.

I worked using Python and SQL. Worked on computer vision-based OCR models

Used Python packages NumPy, Pandas, Seaborn, Pytesseract, OpenCV, and Tensorflow for this computer vision and NLP-based OCR problem

A very large, proprietary feature set was engineered based on the large pool of data collected by East West Bank

Responsible for pulling data from their enormous SQL DataBase using MS SQL for presenting analyses with Tableau.

Developed Apache PySpark modules for retrieving data from the SQL database using PySpark module.

Built a Tree-Based Model with XGBoost which was used to classify cases as either suspicious or safe enabling appropriate, timely action for all suspicious behavior.

Used Python's PyTorch package for building proposed Neural Network models which extrapolated better than the Tree baseline.

Unsupervised Gaussian Mixture Models helped to identify anomalous data points.

Used an ensemble method to combine the results of supervised and unsupervised models.

Model Recall was not allowed below a threshold determined by domain experts.

Worked closely with their Fraud and Cyber Security departments to integrate model outputs into related operations.

Worked with a team of five other specialists and designed Architecture Diagram for the Feature Store to migrate to AWS

I built the model on a cloud notebook platform.

Assisted in handing over the product for deployment in IBM Mainframes with my pickled serialized model.

Performed cross-validation and model selection using k-folds, train-validate-test, and information-theoretic criteria.

Cleaned and prepared raw data by doing Exploratory Data Analysis and Data Visualization using Pandas, Numpy, Seaborn, and matplotlib.

Rapid model creation in Python using Pandas, NumPy, SKLearn, and plotly for data visualization.

Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, and Naive Bayes.

Jan 2012 – Dec 2013 with Alliance Data, Columbus, Ohio

As a Sr. Data Scientist

(Alliance Data is a consumer financial services company. Augmented credit decisions by incorporating deep learning approaches using Cox Proportional Hazard (CPH) with XGBoost backend. Worked on a large complex longitudinal dataset spanning 20 years of data stored in SQL warehouse. Deployed model on a Spark cluster. Our model was able to predict the credit limit and reduce the probability of default).

This project was focused on customer segmentation based on machine learning and statistical modeling effort, including building predictive models and generating data products to support customer segmentation.

Applied analytics concepts of probability, distribution, and statistical inference on given datasets to unearth interesting findings through use of comparison, T-test, F-test, R-squared, P-value etc.

Performed data mining using state-of-the-art methods and executed large efficient SQL queries.

Enhanced data collection procedures to include information relevant for building analytic systems.

Processed, cleaned, and verified the integrity of data used for analysis.

Conducted ad-hoc analysis and presentation of analytics in a clear manner.

Automated anomaly detection systems and constantly tracked performance.

Implemented data architecture and employed data modeling techniques.

Hands-on use of commercial data mining with tools created in R-Programming and Python.

Created machine learning algorithms using Scikit-learn and Pandas.

Built predictive models to forecast risks for credit and loan products.

Developed dashboards for use by executives for ad hoc reporting using the Tableau data visualization tool.

Developed clusters using information from prospect database for enabling marketing initiatives.

Feb 2009 – Dec 2011 with Wegmans, Rochester, New York, NY

As a Sr. Data Analyst

(Wegmans is a grocery store primarily located in the northeast. Their produce division sells organic versions of most of their products in addition to the regular versions. I was tasked with finding a strategy to determine how many more people are willing to pay for organic produce. My primary tool was the Linear Regression model to find how much more the average person would be willing to pay for the organic version of the same product).

Utilized statistical methods to analyze pricing and sales data to determine the value-added of organic labelling on produce products.

Built linear regression models in R-Programming to determine statistically significant coefficients.

Outlined a prescriptive plan to improve sales profits by using more accurately targeted pricing.

Performed data visualization in the data exploration phase using ggplot2.

Utilized a Tobit Regression model to adjust the results for a high amount of censored data.

Presented my findings to stakeholders and decision makers to better inform future decisions.

Performed feature engineering to clean and process the data to feed to my model.

Used outside resources to supplement the data we had already gathered.

Jan 2005 – Jan 2009 with Warren Heating & Cooling, Milford, NJ

IT Software/ Web Development

(Warren Heating & Cooling is a HVAC service provider).

I worked on web sites upgrade.

Modified the web site’s PHP framework.

Wrote new functions in Java and modified existing functions in Java.

Modified multiple scripts written in JavaScript.

Developed Web API functionality for data validation and back-end database communication using ASP.NET, C#, and SQL Server to support the development of front-end interfaces.

Assembled unit tests for a variety of Web API scenarios using Visual Studio’s testing components.

Academic Credentials

Pursuing Academy Database Programming

Master's in computer science from Kennesaw State University - Atlanta, GA

BS in Healthcare Informatics from Mercer University - Atlanta, GA

Publications

Idiopathic Interstitial Pneumonias Medical Image Detection Using Deep Learning Techniques

https://dl.acm.org/citation.cfm?id=3314425 (April 2019)

Research Project

Utilizing algorithmic/programming toolkit to build a variety of machine learning models for data pattern classification. Build predictive models and perform data visualization. Implement formal modeling processes from end to end including data gathering, data profiling, numerical model building, calibration, and cross-validation. Overcame the drawbacks of earlier algorithms. Utilizing techniques such as Deep Learning, Data visualization, and Convolutional Neural Networks. Etc.

Responsibilities

Contact this candidate