Machine learning, Data Scientist

Location:

Cupertino, CA

Posted:

November 23, 2020

Contact this candidate

Resume:

Josh Bacher

669-***-****

*************@*****.***

Python – 10 years

R – 10 years

SQL – 10 years

Data Modeling – 7 years

Neural Networks – 7 years

Senior data scientist specializing in machine learning and other techniques for enhancing business performance and extracting key insights.

EDUCATION:

Miami University

B.S FINANCE

ECONOMICS MINOR

Summary

10 Years Data Science Experience

Self-taught data scientist with ten years of experience bringing robust knowledge of machine learning and statistical techniques to a wide range of business problems within the energy and financial sector. A highly enthusiastic and effective individual who has applied techniques to a diverse set of industry challenges, pssroducing superb outcomes for multiple businesses.

Statistics and Probability, including statistical modelling, statistical hypothesis testing; sound performance executing machine learning projects

Familiarity with trends in relevant technologies and shifts in the data analytics climate

Strong leadership skills with specific experience in the Agile framework; excellent communication skills, both verbal and written

Competent taking machine learning from experimentation to full deployment

Extensive experience with 3rd-party cloud resources: AWS, Google Cloud, and Azure

Developed neural networks architectures from scratch, such as Convolutional (CNN’s), LSTM’s, and Transformers. Also built unsupervised approaches such as k-means, gaussian mixture models, and auto-encoders.

Proficient in all supervised machine learning manners – Linear Regression, Logistic Regression, Support Vector Machines, Random Forests, Gradient Boosting, Survival Modeling

NumPy stack (NumPy, SciPy, Pandas, and matplotlib) and Sklearn.

Proficient in TensorFlow and PyTorch for building, validating, testing, and deploying reliable deep learning algorithms for specific business challenges

Experience with ensemble algorithm techniques, including Bagging, Boosting, and Stacking; knowledge with Natural Language Processing (NLP) methods, in particular FastText, word2vec, sentiment analysis

Technical Skills

Programming

Python, Spark, SQL, R, Git, bash

Libraries

NumPy, Pandas, Scipy, Scikit-Learn, Tensorflow, Keras, PyTorch, statsmodels, Prophet, lifelines, PyFlux, arch, FeatureTools, Lime

Version Control

GitHub, Git, BitBucket

IDE

Pycharm, Sublime, Atom, Jupyter Notebook, Spyder

Data Stores

Large Data Stores, both SQL and noSQL, data warehouse, data lake, Hadoop HDFS, S3

RDBMS

SQL, MySQL, PL/SQL, T-SQL, PostgreSQL

NoSQL

Amazon Redshift, Amazon Web Services (AWS), Cassandra, MongoDB, MariaDB

Computer Vision

Convolutional Neural Network (CNN), Faster R-CNN, YOLO

Big Data Ecosystems

Hadoop (HBase, Hive, Pig, RHadoop, Spark, HDFS), Elastic Search, Cloudera Impala.

Cloud Data Systems

AWS (RDS, S3, EC2, Lambda), Azure, GCP

Data Visualization

Matplotlib, Seaborn, Plotly, Bokeh

NLP

NLTK, Spacy, Gensim, Bert, Elmo

Machine Learning

Supervised and unsupervised Learning algorithms

Machine Learning, Natural Language Processing,, Deep Learning, Data Mining, Neural Networks.

Linear Regression, Lasso and Ridge,

Logistic Regression, Ensemble

Classifiers (Bagging, Boosting and

Voting), Ensemble Regressors, KNN,

Naïve Bayes Classifier, Clustering

(K-MEANS, GMMs, DBSCAN), PCA, SVD, ARIMA

Analytical Methods

Advanced Data Modeling, Regression Analysis, Predictive Analytics, Statistical Analysis (ANOVA, correlation analysis, t-tests and z-tests, descriptive statistics), Sentiment Analysis, Exploratory Data Analysis. Time Series analysis (ARIMA) and forecasting (TBATS, LSTM, ARCH, GARCH)

Principal Component Analysis (PCA) and SVD; Linear and Logistic Regression, Decision Trees and Random Forest

SENIOR DATA SCIENTIST

Apple Inc.

Cupertino, CA July 2020 – Present

Apple, considered one of the Big Tech technology companies, designs & develops consumer electronics, software and other online services. Worked with multiple data science teams under Apple’s Global Business Intelligence unit where we forecasted the loss/gain at different SSG levels, and studied the covid impact of company’s reseller market.

Manipulated GEOTiff files for conducting spatial analysis using MATLAB & Python to plot population densities and overlay other socioeconomic data in various regions across the United States & world.

Built scraping modules using Scrapy, BeautifulSoup, & requests libraries to extract Apple’s reseller data, including locations, discounts, & additional pricing data, along with associated dempographic data within specific US regions.

Created various scripts to load, concatenate, & clean multiple data files used by data science members for analysis and forecasting Apple’s vendors’ future behaviors.

Undertook several techniques for forecasting month ahead loss/gain at each SSG level with Python, including ARIMA, Prophet, and LSTM.

Constructed production-level code for new vendor data that was fed into Tableau for data analysts to present at various times.

Devoted Data Lab, Apple’s cloud platform, to train different time-series models for vendor forecasting .

Exercised appropriate version control using Apple’s Box & Quip platforms to synchronize code & data files with data science members.

SENIOR DATA SCIENTIST

U.S Shell Oil Company

Houston, TX October 2017 – July 2020

Shell Oil Company is one of America’s largest oil and natural gas producers, natural gas marketers and petrochemical manufacturers. Worked with an internal data science team where we tested and built convolutional neural network architectures to analyze surveillance data from drones patrolling portions of the Permian Basin. We were able to tag security threats in near real-time by creating bounding boxes around objects in question.

With the PyTorch Python API, the team built the architecture and trained the convolutional neural networks (CNN).

Exploited transfer learning with custom-built classifiers in PyTorch to speed up production time and improve results.

Fine-tuned ResNet-50, ResNet-101, and ResNet-152 models to adapt their pre-trained weights to our use case.

Used a fully convolutional network (FCN) - pre-trained YOLO v3 algorithm - to speed up predictions.

Took into consideration prediction time and overhead to make sure our predictions happened in real time.

Regularized the data by applying transformations to the images using Pillow.

Worked with large stores of video imaging data stored on AWS S3 buckets for training the model.

Supplied our pickled model to the software development team to integrate into the drone pilot’s heads-up display (HUD).

Employed proper version control using git with BitBucket to coordinate with fellow team members.

Employed AWS Sagemaker to explore object detection at a high level and to train my model before opting for a lower level approach.

Replaced proprietary software with custom-built algorithms for greater control over the outcomes.

DATA SCIENTIST

Dominion Energy

Richmond, VA January 2014 – June 2017

Worked as a data scientist for a large, American power and energy company headquarted in Richmond, Virginia that supplies electricity and natural gas to various states. Member of a small team of data scientists and analysts where we created numerous demand forecasting models from Dominion’s historical data hosted on Hadoop HDFS and Hive, to estimate short-term demand peaks for optimizing economic load dispatch.

Endeavored multiple approaches for predicting day ahead energy demand with Python, including exponential smoothing, ARIMA, Prophet, TBATS, and RNN’s (LSTM)

Successfully built a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) using PyFlux, to model the uncertainty of Dominion’s other time series, ensuring a ‘safety’ stock of generating units

Incorporated geographical and socio-economic data scraped from outside resources to improve accuracy.

Incessantly validated models using a train-validate-test split to ensure forecasting was sufficient to elevate optimal output of the number of generation facilities to meet system load.

Prevented over-fitting with the use of a validation set while training.

Built a meta-model to ensemble the predictions of several different models.

Performed feature engineering with the use of NumPy, Pandas, and FeatureTools to engineer time-series features.

Coordinated with facility engineers to understand the problem and ensure our predictions were beneficial.

Participated in daily standups working under an Agile KanBan environment.

Queried Hive by utilizing Spark through the use of Python’s PySpark Library.

DATA SCIENTIST

Huntington National Bank

Columbus, OH December 2010 – December 2014

Huntington Bancshares, headquartered in Columbus, OH, is a full-service banking provider operating primarily in the Midwest. Was part of a small team of data scientists, data analysts, and credit analysts responsible for building and maintaining classification algorithms to determine loan approvals from prospective customers. With Huntington's data hosted on Azure SQL Database, queried, tested, and analyzed various features to improve the algorithm’s predictive power.

Employed multiple machine learning models with Python, including simple Logistic Regression as baseline, Random Forest, and more complex boosting classifiers such as XGBoost.

Queried databases with SQLAlchemy in Python and loaded the results into Pandas DataFrames.

Routinely performed Stratified K-Fold Cross Validation on models for both model selection and model assessment.

Endlessly worked to engineer and test new features from data sources on Azure SQL Database to assist with models’ abilities to capture signals.

Applied dplyr in R to manipulate data and engineer features.

Harnessed L1 regularization to aid in feature selection

Performed Synthetic Minority Over-sampling Technique (SMOTE) and undersampling for this imbalanced binary classification task.

Wrote a custom loss function by applying weights to a binary cross entropy loss to address the issue of imbalanced data.

Explored the weights of the logistic regression model and the feature importance plots of the XGBoost model to explain and interpret results.

Visualized data and performed exploratory data analysis (EDA) with ggplot2 in R and Matplotlib/Seaborn in Python.

Operated with management to confirm data monezitation proposals were in line with Huntington’s objectives

Hosted the production model on Azure Virtual Machine to serve a Dashboard that aided credit risk analysts in their credit descisions.

Proprietary Stock & Option Trader

First New York Securities

West Palm Beach, FL July 2005 – November 2010

First New York Securities is a multi-stratgy investment firm headquarted in New York, NY, which implements strategies in equities, derivatives, fixed icome, currencies, commodities, and futures across global markets. Was part of a team of traders trading equities and derivatives for the firm’s own account.

Generated $1.5 million in profits for firm during financial crisis (2008-2009) by adjusting time-holding of positions.

Built mathematical models with varying time-scales using StealthAlerts platform to filter securities of interest based on data analysis from previous, successful trades.

Exploited trading secondary-offerings from pattern recognition that spawned increased profits for West Palm Beach office.

Mentored new and unproven traders on particular equities and trades, helped build good habits and acquire logical thought processes, leading to increased performance and profitability for company.

Awarded luxury trips for remaining one of the most profitable traders of the firm.

Contact this candidate