Data Scientist

Location:

Chicago, IL

Posted:

April 04, 2023

Contact this candidate

Resume:

Nicholas Kim

Data Scientist

P: 872-***-****

G: **************@*****.***

PROFESSIONAL SUMMARY

Data Scientist with 7+ years experience processing and analyzing data across a variety of industries. Leverages various mathematical statistical, and Machine Learning tools to collaboratively synthesize business insights and drive innovative solutions for productivity, efficiency, and revenue

Brilliant in developing algorithms and implementing novel approaches to non-trivial business problems in a timely and efficient manner; possess experience in knowledge databases and language ontologies.

Good knowledge of executing solutions with common NLP frameworks and libraries in Python (NLTK, spaCy, gensim) or Java (Stanford CoreNLP, NLP4J). Familiarity with the application of Bayesian Techniques, Advanced Analytics, Neural Networks and Deep Neural Networks, Support Vector Machines (SVMs), and Decision Trees with Random Forest ensemble.

Experience implementing industry standard analytics within specific domains and applying data science techniques to expand these methods using Natural Language Processing, implementing clustering algorithms, and deriving insight.

In-depth knowledge of statistical procedures that are applied in both supervised and unsupervised Machine Learning problems.

Experience applying statistical models on big data sets using cloud-based cluster computing assets with AWS, Azure, and other Unix-based architectures.

Well versed with Machine Learning techniques to promote marketing and merchandising ideas.

Proven creative thinker with a strong ability to devise and propose novel ways to look at and approach problems using a combination of business acumen and mathematical methods.

Skilled in identifying patterns in data and using experimental and iterative approaches to validate findings.

Proficient in using advanced predictive modeling techniques to build, maintain, and improve on real-time decision systems.

Contributed to advanced analytical teams to design, build, validate, and re-train models.

Possess knowledge of remote sensing; well versed in identifying/ creating the appropriate algorithm to discover patterns and validate their findings using an experimental and iterative approach.

Ability to quickly gain an understanding of niche subject matter domains, and design and implement effective novel solutions to be used by other subject matter experts.

Strong communication, interpersonal & analytical skills, with abilities to multi-task & adapt, handling risks under high-pressure environments; creative problem solver, able to think logically and pay close attention to detail.

TECHNICAL SKILLS

•Analytic Development: Python, R, Spark, SQL.

•Libraries: Pandas, NumPy, Matplotlib, Seaborn, sklearn, datetime, xgboost, scipy, statsmodels, regex, random, imblearn

•Frameworks: Azure, Anaconda, Jupyter, Google Colab, Spyder

•Tech/ Tools: Sharepoint, One Drive, Excel, Word, Power Point, Outlook, Microsoft Teams

•Python Packages: NumPy, Pandas, Scikit-learn, TensorFlow, Keras, PyTorch, Fastai, SciPy, Matplotlib, Seaborn, Numba.

•Programming Tools: Jupyter, RStudio, Github, Git.

•Cloud Computing: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP)

Machine Learning, Natural Language Processing & Understanding, Machine Intelligence, Machine Learning algorithms.

•Analysis Methods: Forecasting, Predictive, Statistical, Sentiment, Exploratory and Bayesian Analysis. Regression Analysis, Linear models, Multivariate analysis, Sampling methods, Clustering.

•Applied Data Science: Natural Language Processing, Machine Learning, Social Analytics, Predictive Maintenance, Chatbots, Interactive Dashboards.

•Artificial Intelligence: Classification and Regression Trees (CART), Support Vector Machine, Random Forest, Gradient Boosting Machine (GBM), TensorFlow, PCA, Regression, Naïve Bayes.

•Natural Language Processing: Text analysis, classification, chatbots.

•Deep Learning: Machine Perception, Data Mining, Machine Learning, Neural Networks, TensorFlow, Keras, PyTorch, Transfer Learning.

•Data Modeling: Bayesian Analysis, Statistical Inference, Predictive Modeling, Stochastic Modeling, Linear Modeling, Behavioral Modeling, Probabilistic Modeling, Time-Series analysis.

•Soft Skills: Excellent communication and presentation skills. Ability to work well with stakeholders to discern needs. Leadership, mentoring.

•Other Programming Languages & Skills: APIs, C++, Java, Linux, Kubernetes, Back-End, Databases.

WORK EXPERIENCE

Execu Search, Chicago, IL October 2022 - Present

Senior Data Scientist

Project Synopsis: Provide general Data Science support: clean data, model data, maintain models, automate models, provide higher-level analysis, and make data/models ingestible for non-technical employees, etc.

Responsibilities:

Utilized both Jupyter notebooks (local) and Google Colab (cloud) as interactive notebook environments.

Used Anaconda in conjunction with Jupyter notebooks and Spyder to use virtual environments for script-developing purposes.

JSON and Pickle saving/accessing for files such as requirements txt, easily editable script presets, and model access.

Python for all code-writing related tasks.

Azure Machine Learning as virtual environment for automation purposes

Used Spark Beyond which is a data analysis tool for feature engineering.

Developed and automated Lease Term model, using an ensemble model structure of various XG-Boost Classifier and Regressor models.

Cleaned and prepped call-notes transcripts using NLP techniques and used Topic Modeling to develop an automatic topic-producing and summarizing tool for comprehensive quarterly call notes.

Used Python to clean up lease data for the purposes of creating logic for whether a lease was renewed or if the tenant relocated.

Used Random Forest, XG-Boost, Catboost, Linear Regression, Logistic Regression, Grid Search, Random Search, Neural Networks to model the data once the renewal/relocation logic as developed.

Used Python to clean up lease data for the purposes of creating logic for whether a tenant renewed relationship with real-estate agent company or not (incumbency)

Used Random Forest, XG-Boost, Catboost, Linear Regression, Logistic Regression, Grid Search, Random Search, Neural Networks to model the data once the incumbency logic was developed.

Utilized Leave-One-Out, One-vs-One, One-vs-all, and other strategies of classification to acquire the best situation-appropriate approach.

Utilized multiple methods of Cross Validation to ensure models weren’t overfitting or experiencing data leakage.

Extensive utilization of Microsoft Excel to manipulate data.

Daily utilization of Pandas, NumPy, sklearn, matplotlib, seaborn, and other libraries for data manipulation, graphing/plotting, modeling, and everything in between

Incorporated surveys, games, and interactive tools into power point presentations to present to non-technical higher-ups.

Successfully navigated presentation of lease term model along with the answering of questions to audiences varying in size from 1 to 200 people.

Engaged with multiple departments’ leaders to collaborate on ideas and concepts that would benefit both of us with its use cases.

Worked with a team size of 4: 3 data scientists and 1 data analyst. Took the lead on several projects while others took leads on others. Was involved at some level on all projects.

Engaged in Agile Scrum consisting of 4 team members and 2 managers. Tasks were laid out, and challenges consistently shared.

Bank of America, Charlotte, NC February 2020 – October 2022

Senior Data Scientist

At Bank of America, I worked as a Natural Language Processing expert and model architect where I built, trained, and tested multiple Natural Language Processing models which classified user descriptions and wrote SQL code based on user questions. The goal of the project was to centralize and search for Splunk dashboards within the Bank of America network, and to create an A.I. assistant to automate the coding process to extract information from these dashboards.

•Used Python and SQL to collect, explore, analyze the structured/unstructured data.

•Used Python, NLTK, and Tensorflow to tokenize and pad comments/tweets and vectorize.

•Vectorized the documents using Bag of Words, TF-IDF, Word2Vec, and GloVe to test the performance it had on each model.

•Created and trained an Artificial Neural Network with TensorFlow on the tokenized documents/articles/SQL/user inputs.

•Performed Named Entity Recognition (NER) by utilizing ANNs, RNNs, LSTMs, and Transformers.

•Involved in model deployment using Flask with a REST API deployed on internal Bank of America systems.

•Wrote extensive SQL queries to extract data from the MySQL database hosted on Bank of America internal servers.

•Built a deep learning model for text classification and analysis.

•Performed classification on text data using NLP fundamental concepts including tokenization, stemming, lemmatization, and padding.

•Performed EDA using Pandas library in Python to inspect and clean the data.

•Visualized the data using matplotlib and seaborn.

•Explored using word embedding techniques such as Word2Vec, GloVe, and Bert.

•Built an ETL pipeline that could read data from multiple macros, processed it using self-made preprocessing functions, and stored the processed data on a separate internal server.

•Automated ETL tasks and scheduling using self-built data pull-request functions.

Dominion Energy, Richmond, VA June 2017 – February 2020

Data Scientist / ML Ops Engineer

Worked as a Data Scientist for a large American power and energy company headquartered in Richmond, Virginia that supplies electricity and natural gas to various states. Member of a small team of data scientists and analysts where we created numerous demand forecasting models from Dominion’s historical data hosted on Hadoop HDFS and Hive to estimate short-term demand peaks for optimizing economic load dispatch. Models were built using Time Series analysis using algorithms like ARIMA, SARIMA, ARIMAX, and Facebook Prophet.

•Endeavored multiple approaches for predicting day ahead energy demand with Python, including exponential smoothing, ARIMA, Prophet, TBATS, and RNNs (LSTM).

•Successfully built a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) using PyFlux, to model the uncertainty of Dominion’s other time series, ensuring a ‘safety’ stock of generating units.

•Incorporated geographical and socio-economic data scraped from outside resources to improve accuracy.

•Incessantly validated models using a train-validate-test split to ensure forecasting was sufficient to elevate optimal output of the number of generation facilities to meet system load.

•Prevented over-fitting with the use of a validation set while training.

•Built a meta-model to ensemble the predictions of several different models.

•Performed feature engineering with the use of NumPy, Pandas, and FeatureTools to engineer time-series features.

•Coordinated with facility engineers to understand the problem and ensure our predictions were beneficial.

•Participated in daily standups working under an Agile KanBan environment.

•Queried Hive by utilizing Spark with Python’s PySpark Library.

Cargill, Minneapolis, MN June 2015 – June 2017

Computer Vision Engineer

Cargill is an American privately held international food conglomerate; major businesses are trading, purchasing and distributing grain and other agricultural commodities. Our team used CNNs with Computer Vision to build the Machine Learning model to detect unhealthy hydrophytes. Our model helped regulators work more efficiently by detecting unhealthy hydrophytes in hydroponic farming automatically, and increased their harvesting rate which increased their revenue.

•Performed statistical analysis and built statistical models in R and Python using various supervised and unsupervised Machine Learning algorithms like Regression, Decision Trees, Random Forests, Support Vector Machines, K-Means Clustering, and dimensionality reduction.

•Used MLlib, Spark's Machine Learning library, to build and evaluate different models.

•Defined the list codes and code conversions between the source systems and the data mart enterprise metadata library with any changes or updates.

•Developed Ridge regression model to predict energy consumption of customers. Evaluated model using Mean Absolute Percent Error (MAPE).

•Developed and enhanced statistical models by leveraging best-in-class modeling techniques.

•Developed a predictive model and validated Neural Network Classification model for predicting the feature label.

•Implemented logistic regression to model customer default and identified factors that were good predictors.

•Designed a model to predict if a customer would respond to marketing campaign based on customer information.

•Developed Random Forest and logistic regression models to observe this classification. Fine-tuned models to obtain more recall than accuracy. Tradeoff between False Positives and False Negatives.

•Evaluated and optimized performance of models by tuning parameters with K-Fold Cross Validation.

Hilton Hotels, McLean, VA April 2014 – June 2015

Data Analyst

Worked with NLP to classify text with data draw from a big data system. The text categorization involved labeling natural language texts with relevant categories from a predefined set. One goal was to target users by automated classification. In this way we could create cohorts to improve marketing. The NLP text analysis monitored, tracked, and classified user discussion about product and/or service in online discussion. The Machine Learning classifier was trained to identify whether a cohort was a promoter or a detractor. Overall, the project improved marketing ROI and customer satisfaction. Also incorporated a Churn Analysis model to examine repeat business/drop off.

•Worked the entire production cycle to extract and display metadata from various assets and helped develop a report display that was easy to grasp and gain insights from.

•Collaborated with both the Research and Engineering teams to productionize the application.

•Assisted various teams in bringing prototyped assets into production.

•Applied data mining techniques and optimization techniques standard to B2B and B2C industries, and applied Machine Learning, Data/Text Mining, Statistical Analysis and Predictive Modeling.

•Utilized MapReduce/PySpark Python modules for Machine Learning and predictive analytics on AWS.

•Implemented assets and scripts for various projects using R, Java, and Python.

•Built sustainable rapport with senior leaders.

•Developed and maintained Data Dictionary to create metadata reports for technical and business purposes.

•Built and maintained dashboard and reporting based on the statistical models to identify and track key metrics and risk indicators.

•Kept up to date with latest NLP methodologies by reading 10 to 15 articles and whitepapers per week.

•Extracted source data from Oracle tables, MS SQL Server, sequential files, and Excel sheets.

•Parsed and manipulated raw, complex data streams to prepare for loading into an analytical tool.

•Involved in defining the source to target data mappings, business rules, and data definitions.

•Project environment was AWS and Linux.

EDUCATION

Bachelor of Arts - Data Science - University of California, Berkeley

Contact this candidate