Josh Bacher
adgh2q@r.postjobfree.com
Python – 10 years
R – 10 years
SQL – 10 years
Data Modeling – 7 years
Neural Networks – 7 years
Senior data scientist specializing in machine learning and other techniques for enhancing business performance and extracting key insights.
EDUCATION:
Miami University
B.S FINANCE
ECONOMICS MINOR
Summary
10 Years Data Science Experience
Self-taught data scientist with ten years of experience bringing robust knowledge of machine learning and statistical techniques to a wide range of business problems within the energy and financial sector. A highly enthusiastic and effective individual who has applied techniques to a diverse set of industry challenges, pssroducing superb outcomes for multiple businesses.
Statistics and Probability, including statistical modelling, statistical hypothesis testing; sound performance executing machine learning projects
Familiarity with trends in relevant technologies and shifts in the data analytics climate
Strong leadership skills with specific experience in the Agile framework; excellent communication skills, both verbal and written
Competent taking machine learning from experimentation to full deployment
Extensive experience with 3rd-party cloud resources: AWS, Google Cloud, and Azure
Developed neural networks architectures from scratch, such as Convolutional (CNN’s), LSTM’s, and Transformers. Also built unsupervised approaches such as k-means, gaussian mixture models, and auto-encoders.
Proficient in all supervised machine learning manners – Linear Regression, Logistic Regression, Support Vector Machines, Random Forests, Gradient Boosting, Survival Modeling
NumPy stack (NumPy, SciPy, Pandas, and matplotlib) and Sklearn.
Proficient in TensorFlow and PyTorch for building, validating, testing, and deploying reliable deep learning algorithms for specific business challenges
Experience with ensemble algorithm techniques, including Bagging, Boosting, and Stacking; knowledge with Natural Language Processing (NLP) methods, in particular FastText, word2vec, sentiment analysis
Technical Skills
Programming
Python, Spark, SQL, R, Git, bash
Libraries
NumPy, Pandas, Scipy, Scikit-Learn, Tensorflow, Keras, PyTorch, statsmodels, Prophet, lifelines, PyFlux, arch, FeatureTools, Lime
Version Control
GitHub, Git, BitBucket
IDE
Pycharm, Sublime, Atom, Jupyter Notebook, Spyder
Data Stores
Large Data Stores, both SQL and noSQL, data warehouse, data lake, Hadoop HDFS, S3
RDBMS
SQL, MySQL, PL/SQL, T-SQL, PostgreSQL
NoSQL
Amazon Redshift, Amazon Web Services (AWS), Cassandra, MongoDB, MariaDB
Computer Vision
Convolutional Neural Network (CNN), Faster R-CNN, YOLO
Big Data Ecosystems
Hadoop (HBase, Hive, Pig, RHadoop, Spark, HDFS), Elastic Search, Cloudera Impala.
Cloud Data Systems
AWS (RDS, S3, EC2, Lambda), Azure, GCP
Data Visualization
Matplotlib, Seaborn, Plotly, Bokeh
NLP
NLTK, Spacy, Gensim, Bert, Elmo
Machine Learning
Supervised and unsupervised Learning algorithms
Machine Learning, Natural Language Processing,, Deep Learning, Data Mining, Neural Networks.
Linear Regression, Lasso and Ridge,
Logistic Regression, Ensemble
Classifiers (Bagging, Boosting and
Voting), Ensemble Regressors, KNN,
Naïve Bayes Classifier, Clustering
(K-MEANS, GMMs, DBSCAN), PCA, SVD, ARIMA
Analytical Methods
Advanced Data Modeling, Regression Analysis, Predictive Analytics, Statistical Analysis (ANOVA, correlation analysis, t-tests and z-tests, descriptive statistics), Sentiment Analysis, Exploratory Data Analysis. Time Series analysis (ARIMA) and forecasting (TBATS, LSTM, ARCH, GARCH)
Principal Component Analysis (PCA) and SVD; Linear and Logistic Regression, Decision Trees and Random Forest
SENIOR DATA SCIENTIST
Apple Inc.
Cupertino, CA July 2020 – Present
Apple, considered one of the Big Tech technology companies, designs & develops consumer electronics, software and other online services. Worked with multiple data science teams under Apple’s Global Business Intelligence unit where we forecasted the loss/gain at different SSG levels, and studied the covid impact of company’s reseller market.
Manipulated GEOTiff files for conducting spatial analysis using MATLAB & Python to plot population densities and overlay other socioeconomic data in various regions across the United States & world.
Built scraping modules using Scrapy, BeautifulSoup, & requests libraries to extract Apple’s reseller data, including locations, discounts, & additional pricing data, along with associated dempographic data within specific US regions.
Created various scripts to load, concatenate, & clean multiple data files used by data science members for analysis and forecasting Apple’s vendors’ future behaviors.
Undertook several techniques for forecasting month ahead loss/gain at each SSG level with Python, including ARIMA, Prophet, and LSTM.
Constructed production-level code for new vendor data that was fed into Tableau for data analysts to present at various times.
Devoted Data Lab, Apple’s cloud platform, to train different time-series models for vendor forecasting .
Exercised appropriate version control using Apple’s Box & Quip platforms to synchronize code & data files with data science members.
SENIOR DATA SCIENTIST
U.S Shell Oil Company
Houston, TX October 2017 – July 2020
Shell Oil Company is one of America’s largest oil and natural gas producers, natural gas marketers and petrochemical manufacturers. Worked with an internal data science team where we tested and built convolutional neural network architectures to analyze surveillance data from drones patrolling portions of the Permian Basin. We were able to tag security threats in near real-time by creating bounding boxes around objects in question.
With the PyTorch Python API, the team built the architecture and trained the convolutional neural networks (CNN).
Exploited transfer learning with custom-built classifiers in PyTorch to speed up production time and improve results.
Fine-tuned ResNet-50, ResNet-101, and ResNet-152 models to adapt their pre-trained weights to our use case.
Used a fully convolutional network (FCN) - pre-trained YOLO v3 algorithm - to speed up predictions.
Took into consideration prediction time and overhead to make sure our predictions happened in real time.
Regularized the data by applying transformations to the images using Pillow.
Worked with large stores of video imaging data stored on AWS S3 buckets for training the model.
Supplied our pickled model to the software development team to integrate into the drone pilot’s heads-up display (HUD).
Employed proper version control using git with BitBucket to coordinate with fellow team members.
Employed AWS Sagemaker to explore object detection at a high level and to train my model before opting for a lower level approach.
Replaced proprietary software with custom-built algorithms for greater control over the outcomes.
DATA SCIENTIST
Dominion Energy
Richmond, VA January 2014 – June 2017
Worked as a data scientist for a large, American power and energy company headquarted in Richmond, Virginia that supplies electricity and natural gas to various states. Member of a small team of data scientists and analysts where we created numerous demand forecasting models from Dominion’s historical data hosted on Hadoop HDFS and Hive, to estimate short-term demand peaks for optimizing economic load dispatch.
Endeavored multiple approaches for predicting day ahead energy demand with Python, including exponential smoothing, ARIMA, Prophet, TBATS, and RNN’s (LSTM)
Successfully built a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) using PyFlux, to model the uncertainty of Dominion’s other time series, ensuring a ‘safety’ stock of generating units
Incorporated geographical and socio-economic data scraped from outside resources to improve accuracy.
Incessantly validated models using a train-validate-test split to ensure forecasting was sufficient to elevate optimal output of the number of generation facilities to meet system load.
Prevented over-fitting with the use of a validation set while training.
Built a meta-model to ensemble the predictions of several different models.
Performed feature engineering with the use of NumPy, Pandas, and FeatureTools to engineer time-series features.
Coordinated with facility engineers to understand the problem and ensure our predictions were beneficial.
Participated in daily standups working under an Agile KanBan environment.
Queried Hive by utilizing Spark through the use of Python’s PySpark Library.
DATA SCIENTIST
Huntington National Bank
Columbus, OH December 2010 – December 2014
Huntington Bancshares, headquartered in Columbus, OH, is a full-service banking provider operating primarily in the Midwest. Was part of a small team of data scientists, data analysts, and credit analysts responsible for building and maintaining classification algorithms to determine loan approvals from prospective customers. With Huntington's data hosted on Azure SQL Database, queried, tested, and analyzed various features to improve the algorithm’s predictive power.
Employed multiple machine learning models with Python, including simple Logistic Regression as baseline, Random Forest, and more complex boosting classifiers such as XGBoost.
Queried databases with SQLAlchemy in Python and loaded the results into Pandas DataFrames.
Routinely performed Stratified K-Fold Cross Validation on models for both model selection and model assessment.
Endlessly worked to engineer and test new features from data sources on Azure SQL Database to assist with models’ abilities to capture signals.
Applied dplyr in R to manipulate data and engineer features.
Harnessed L1 regularization to aid in feature selection
Performed Synthetic Minority Over-sampling Technique (SMOTE) and undersampling for this imbalanced binary classification task.
Wrote a custom loss function by applying weights to a binary cross entropy loss to address the issue of imbalanced data.
Explored the weights of the logistic regression model and the feature importance plots of the XGBoost model to explain and interpret results.
Visualized data and performed exploratory data analysis (EDA) with ggplot2 in R and Matplotlib/Seaborn in Python.
Operated with management to confirm data monezitation proposals were in line with Huntington’s objectives
Hosted the production model on Azure Virtual Machine to serve a Dashboard that aided credit risk analysts in their credit descisions.
Proprietary Stock & Option Trader
First New York Securities
West Palm Beach, FL July 2005 – November 2010
First New York Securities is a multi-stratgy investment firm headquarted in New York, NY, which implements strategies in equities, derivatives, fixed icome, currencies, commodities, and futures across global markets. Was part of a team of traders trading equities and derivatives for the firm’s own account.
Generated $1.5 million in profits for firm during financial crisis (2008-2009) by adjusting time-holding of positions.
Built mathematical models with varying time-scales using StealthAlerts platform to filter securities of interest based on data analysis from previous, successful trades.
Exploited trading secondary-offerings from pattern recognition that spawned increased profits for West Palm Beach office.
Mentored new and unproven traders on particular equities and trades, helped build good habits and acquire logical thought processes, leading to increased performance and profitability for company.
Awarded luxury trips for remaining one of the most profitable traders of the firm.