Data Scientist / NLP Engineer

Location:

Posted:

November 14, 2022

Resume:

Professional Summary

* ***** ** ********** ** the Data Science and Machine Learning field with demonstrated skill in statistical analysis, data analytics, data modeling, and creation of custom algorithms.

Adept with machine learning and neural networks using a variety of systems and methods.

Experienced with AI application domain of NLP and Computer Vision.

Experience with a variety of NLP methods for information extraction, topic modeling, parsing, and relationship extraction

Develop, deploy, and maintain production NLP models for scalability.

Work with Natural Language Processing with NLTK, SpaCy, and other modules for application development for automated customer responses.

Expert in Robotic Process Automation (RPA) to implement Machine Learning model for NLP & CV

Applies advanced statistical and predictive modeling techniques to build, maintain, and improve multiple, real-time decision systems. Closely works with product managers, Service development managers, and product development team in productizing the algorithms developed.

Experience in designing star schema, Snowflake schema for Data Warehouse, and ODS architecture.

Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards and Storylines on web and desktop platforms.

Hands-on experience working with Support Vector Machines (SVM), K Nearest Neighbors (KNN), Random Forests, Decision Trees, Bagging, Boosting (AdaBoost, Gradient Boost, XGBoost), Neural Networks (FNN, CNN, RNN, LSTM).

Experience with Public Cloud (Google Cloud, Amazon AWS, and/or Microsoft Azure).

Experience with knowledge databases and language ontologies.

Experience with data analysis methods such as data reporting, ad-hoc data reporting, graphs, scales, pivot tables, OLAP reporting with Microsoft Excel, R Markdown, R Shiny, Python Markdown, R Studio

Discover patterns in data using algorithms and SQL queries and use an experimental and iterative approach to validate findings in Python using TensorFlow.

Creative thinking and proposing innovative ways to look at problems by using data mining approaches on the set of information available.

Identifies/creates the appropriate algorithm to discover patterns and validate their findings using an experimental and iterative approach.

Experience in working with relational databases (Teradata, Oracle) with advanced SQL programming skills.

In-depth knowledge of statistical procedures that are applied in Supervised / Unsupervised problems

Basic-Intermediate level proficiency in SAS (Base SAS, Enterprise Guide, Enterprise Miner) & in UNIX

Track record of applying machine learning techniques to marketing and merchandising ideas.

Technical Skills

Data Science Specialties: Natural Language Processing, Machine Learning, Predictive Maintenance, Stochastic Analytics, Internet of Things (IoT) analytics, Social Analytics

Analytic Skills: Bayesian Analysis, Inference, Models, Segmentation, Clustering, Naïve Bayes Classification, Sentiment Analysis, Predictive Analytics, Regression Analysis, Linear models, Multivariate analysis, Stochastic Gradient Descent, Sampling methods, Forecasting

Data Query: Azure, Google BigQuery, Amazon RedShift, Kinesis, EMR; HDFS, RDBMS, SQL, MongoDB, HBase, Cassandra, NoSQL, data warehouse, data lake, and various SQL and NoSQL databases and data warehouses.

Languages: Python, R, Command Line, C++/C, SQL, Java

Version Control: GitHub, Git, SVN

IDEs: Jupyter Notebook, PyCharm, IntelliJ, Spyder, Eclipse

RPA: Implemented Robotic Process Automation into several industries

Deep Learning: Multi-Layer Perceptron, Machine Learning algorithms, Neural Networks, TensorFlow, Keras, PyTorch, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), LSTM

Python Packages: Numpy, Pandas, Scikit-learn, Tensorflow, SciPy, Matplotlib, Seaborn, Plotly, NLTK, Scrapy, Gensim

Analytic Tools: Classification and Regression Trees (CART), H2O, Support Vector Machine, Random Forest, Gradient Boosting Machine (GBM), TensorFlow, PCA, RNN, Linear and Logistic Regression

Analytic Languages and Scripts: Python, R, HiveQL, Spark, Spark MLlib, Spark SQL, Hadoop, Scala, Impala, MapReduce

Soft Skills: Deliver presentations and technical reports. Collaborate with stakeholders and cross-functional teams. Advise about how to leverage analytical insights. Develop analytical reports that directly address strategic goals.

Cloud Computing: AWS, GCP, Azure

Professional Experience

Senior Data Scientist (AI / NLP) - BearingPoint Consulting, Chicago, IL

01/2021 - Present

BearingPoint is one of the largest consulting firms in Data & Analytics, RPA, and several technological areas. I work as a Data Scientist/NLP Engineer in several industries and services (RPA included). My latest project is for a car manufacturing company involving the development of an NLP model to automate the categorization of chatbot conversations. The chatbot is used to diagnose client problems relating to their vehicle, and the conversations were classified as one of the multiple categories according to NHTSA protocol. Additionally, I worked on other Customer service and customer analytics use cases like creating hybrid recommendation systems, Customer Segmentation and used different Machine learning models for Marketing Analytics. All my models were deployed into production on AWS & GCP Cloud Environments and some feed a Robotic Process Automation workflow.

oPerformed data profiling on available data sources to identify potentially useful data sources for the proposed machine learning use cases.

oConsulted with the Compliance Department to determine relevant use cases.

oApplied models Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), LSTMs.

oUsed with NLP transformers such as BERT, GPT, ELMO, and more

oApplied Python packages NumPy, Matplotlib, Plotly, Pandas, SciPy, and FeatureTools for data analytics, data cleaning, and feature engineering.

oWorked on Segmentation problems using KMeans Clustering, DBSCAN, and Hierarchical Clustering.

oDesigned recommendation engines utilizing Content-based and Collaborative Filtering

oUtilized Docker, AWS, Python, NoSQL, and Kubernetes.

oUsed Hadoop HBase on Spark using PySpark modules for retrieving data from a NoSQL database and data analysis.

oUtilized NLTK and Gensim for NLP processes such as Tokenization and for creating custom Word Embeddings.

oUtilized TensorFlow for building Neural Network models.

oUsed Bagging and Boosting Methods (XGBoost, Random Forrest, etc.).

oUtilized Docker to contain the model for use in applications.

oDeployed operational models to a RESTful API using the Python Flask package and Docker containers.

oCreated customized applications to make critical predictions, automate reasoning and decisions and calculate optimization algorithms.

oDeveloped advanced easy-to-understand visualizations to map and simplify the analysis of heavily numeric data and reports.

oDesigned, implemented, and evaluated new models and rapid software prototypes to solve problems in machine learning and systems engineering.

oAnalyzed data to generate logic for new systems, procedures, and tests.

oImplemented and evaluated artificial intelligence and machine learning algorithms and neural networks for diverse industries.

oImproved performance of models with fine-tuning and data cleaning.

oApplied new technologies to improve the performance of models and reduce the time to build machine learning models

oPracticed Agile approaches, including Extreme Programming, Test-Driven Development, and Agile Scrum.

Data Scientist - Elevance Health (Formerly, Anthem Inc), Indianapolis, IN

01/2019 – 12/2020

Anthem is one of the largest Healthcare and Hospital companies across the US. I served as a Data Scientist interacting with 3 Data Science development teams working on data extraction and analysis, UI development/optimization, and ML pipeline builds. My work involved interacting with 4 Data Scientists, 3 Software Engineers, 1 Project Manager, and 1 Head of Data Science. I have developed models from end to end for Object Character Recognition of scanned documents, Fraud Detection, and churn prediction. Also developed Time Series models for price prediction of services. I used Google Cloud Platform (GCP) for the deployment and maintenance of my models.

oExtracted text from documents using OCR.

oApplied cosine similarity and BERT to find relevant sections of text in documents.

oApplied OCR to extract handwritten signatures and dates.

oUtilized Amazon Textract machine learning (ML) service to automatically extract data from scanned handwritten documents.

oGenerated Regex patterns to collect text from relevant sections.

oUtilized OpenCV to find page numbers and text coordinates.

oStored data on a local Hadoop cluster.

oLed weekly presentations to business stakeholders to refine the output.

oUsed Jira for sprint planning and cards.

oUsed Bitbucket and Git for code management.

oBuilt deep learning neural network models from scratch using GPU-accelerated libraries like PyTorch.

oEmployed PyTorch, Scikit-Learn, and XGBoost libraries to build and evaluate the performance of different models.

oUsed Time Series Analysis with ARIMA, SARIMAX, LSTM, and RNN

oUsed Containers and Kubernetes for Model Deployment

oBuild CI-CD pipelines to automate the end-to-end processes involved from raw data to data product.

oTroubleshot machine-learning models in Python (TensorFlow) code to keep the pipeline moving using PyTest packages.

oApplied Fuzzysearch algorithms to help locate records for relevant searches.

oDesigned deep learning models and Amazon Web Services (AWS) EC2 for model training using TensorFlow.

oUsed gradient boosted trees and random forests to create a benchmark for potential accuracy

oUtilized K-Means, Gaussian Mixture Models, DBSCAN, and Hierarchical Agglomerative clustering algorithms to discover patient groups.

oFilled missing data using k-Nearest Neighbors (kNN). (Python, sklearn)

oDocumented solutions and presented the results to stakeholders.

oUtilized Python, Pandas, Seaborn, and Matplotlib to produce data visualizations highlighting a variety of metrics.

Data Scientist - MSNBC, New York, NY

04/2017 to 01/2019

MSNBC is a paid, news-focused American television cable channel that provides NBC News coverage as well as its reporting and political commentary on current events. I worked on a Data Science team that worked on optimizing the amount of time the company spent manually sorting information. The team worked on video summarization and built an NLP-based filter constructed and trained it on the summaries and other sources like Twitter feeds. This filter sorted these feeds into categories of Spam, Reviews, and News as well as sorted them into good, bad, and neutral categories. Finally, once the tweets were identified, they were further filtered into real or fake news with the final output being only real news and the overall sentiment of that news. Created recommender systems based on news categories.

oClassified thousands of articles and tweets to build a complete dataset for the model.

oWorked on video summarization by implementing Deep learning models like Convolutional Neural Networks (CNNs) and Deep Convolutional Neural Networks (DCNNs) have been used including AlexNet, variations of ResNet, and variations of VGGnet.

oConstructed an NLP-based filter utilizing embedding and LSTM layers in TensorFlow and Keras.

oProduced classifications of whether a given text was news or fit into other categories of potential interest, such as spam.

oCleaned text to standardize input to the model and ensure consistent results.

oBuilt functions to automatically remove symbols, hyperlinks, and emojis, and did spell checking on the received text.

oBuilt exception handling to treat potential edge cases of incorrect or unusable data being fed to the model in production.

oRan sentiment analysis of text and determined whether the text was overall positive, negative, or neutral.

oDeployed solutions to a Flask app on a cloud-based service (AWS) to which future user applications are connected via an API.

oTested and compared this solution to those of AWS Sagemaker’s Comprehend, which achieved a slightly higher accuracy of 94.7%.

oPerformed stemming and lemmatization of text to remove superfluous components and make the resulting corpus as small as possible while containing all important information.

oProduced a bag of words compiled and built from scratch using NLTK and TensorFlow packages for text processing and tokenization.

oFinalized model was then handed over to Android and iOS app developers along with web developers to create a user front-end.

Data Scientist - Penske Automotive Group, Bloomfield Hills, MI

02/2016 to 04/2017

Penske Automotive Group is an international transportation services company that operates automotive and commercial truck dealerships principally in the United States, Canada, and Western Europe. I worked on a team tasked to develop a data science model to increase revenue by lot space utilized by cars. This project is specific to the used vehicle segment of the business. I applied Machine Learning and Deep Learning to model the sale price of used cars as well as performed Survival Analysis and Time Series Analysis to predict how long used cars would sit on a lot before being sold and the cars’ prices over time.

oInterpreted statistical results to facilitate enhanced decision-making by stakeholders.

oModeled sale price data with Neural Networks using Keras API for TensorFlow, Random Forests, and XGBoost (Gradient Boosted Decision Trees).

oUsed ARIMA and SARIMA models for Time Series

oApplied Python to perform Survival Analysis on time until sale date using Cox Proportional Hazards.

oConducted statistical significance testing on factors in Cox Proportional Hazards.

oScraped data / mined data on replacement part prices to assess the cost of damages.

oProduced SQL tables/databases to store part of the pricing data.

oClassified degree of damage in used car damage reports using Sentiment Analysis Natural Language Processing (NLP).

oWrote SQL queries to merge data from multiple tables to obtain relationships between used car model data, damage data, sale price data, and time until sale data.

oUtilized a Git versioning environment.

oProvided actionable insights about which used cars to keep on the lot to maximize revenue from lot space by determining which cars provided the largest return per unit time on the lot.

oUsed Tableau to create visualizations of pricing and survival analysis.

Education

Bachelor of Science - Business Data Analytics, Emporia University, KS

oSpecialized in Mathematics, Statistics, and Robotics & Automation with Machine Learning

oGraduated with Honors

Certifications

Python Fundamentals

Machine Learning Fundamentals

Data Engineering Fundamentals

Cloud Computing Fundamentals

Data Analytics Fundamentals

TensorFlow 2.0 Fundamentals

Data Types for Data Science in Python

Data Manipulation with Pandas

Intermediate Python

Object-Oriented Programming in Python

Regular Expressions in Python

Web Scraping and Spiders in Python

Writing Efficient Python Code

Statistical Thinking in Python

Supervised Learning with Scikit-Learn

Unsupervised Learning with Python

Linear Classifiers in Python

Machine Learning with Tree-Based Models

Artificial Neural Networks in Python

Convolutional Neural Networks in Python

UiPath Advanced RPA Developer

Data Analytics in R

Advanced Data Analysis Nano Degree in Python

Research Paper / Publication

Chavarria, J., Flores, J., Mostafa, S., & Riedy, M. (2022). Who is ‘SLAPPing’ Whom?. Mountain Plains Journal of Business and Technology, 23(1), https://openspaces.unk.edu/mpjbt/vol23/iss1/1

Contact this candidate