Data scientist

Location:

Richardson, TX

Salary:

95$ Hour

Posted:

November 14, 2020

Contact this candidate

Resume:

Luis Trevino

Data Scientist

Phone: 832-***-****

Email: **************@*****.***

Technical Skills

Data Science

Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, Neural Networks, Machine Perception

Programming Languages

Java, Python, R, MatLab

Frameworks

Spark, Spark Streaming, Kinesis, Kafka, Hive, Pig, Scoop, MapReduce, Storm, Ambari

Data Analytics Methods

Advanced Data Modeling, Forecasting Models, Regression Analysis, Predictive Analytics, Statistical Analysis, Sentiment Analysis, Stochastic optimization, hypothesis testing, Time series analysis

A/B Testing, Clustering

Database/Datastores

HDFS, RDBMS, Data Lake, Data Warehouse, Amazon Redshift, Cassandra, Aurora, SQL, PostgreSQL

Machine Learning

TensorFlow, Keras, Theano, Caffe, Glucon, MxNet, PyTorch, Deeplearning4j, CNTK, Keras

Data Query

Azure, Google, Amazon RedShift, Kinesis, EMR; HDFS, RDBMS, SQL and noSQL, data warehouse, data lake and various SQL and NoSQL databases and data warehouses.

Project Management

Agile Scrum

Database

Oracle, RDBMS, SQL, NoSQL, Cassandra, Redshift, Aurora, Mongo

Visualization

R, Tableau, ggPlot2, dygraphs

Years of Experience

•7 years Data Science

•7 Data Analysis

•5 Software Engineering

Professional Summary

Proficiency in Predictive Modeling, Text mining and Machine Learning

Proficient in applying Statistical Modeling and Machine Learning techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, PCA, Ensembles

Experience in Text Mining and good knowledge on NLP components such as Natural Language Understanding (NLU) and Natural Language Generation (NLG).

Knowledge in Hadoop Core Components (HDFS, MapReduce) and Hadoop Ecosystem (Sqoop, Hive, Pig)

Experience in Univariate, Multivariate Analysis, model testing, problem analysis, model comparison and validating model, ANOVA, Regression Analysis.

Expertise in writing complex SQL queries to obtain filtered data for analysis purpose

Working knowledge in implementing tree-based models such as Boosting, Random Forest.

Experience in using Model Pipelines to automate the tasks and put models into production quickly

Experience in using Tableau, creating dash boards and quality story telling.

Worked with various python libraries such as NumPy, SciPy for mathematical calculations, Pandas for data preprocessing/wrangling, Mat plot, Seaborn for data visualization, Scikit-learn for machine learning or Deep leaning and NLTK for NLP.

Develops high-performance algorithms for precision targeting, testing and implementing these algorithms.

Understanding of data structures and algorithms

Experience in scripting languages (Python, R, Scala, Spark)

Development of Artificial neural networks (ANN) and Convolution neural networks (CNN)

Ability to initiate and drive projects from conception to completion with minimal guidance and deliver superior performance

Developed statistical models in R and Python using various supervised and unsupervised machine learning algorithms such as Linear Regression and Logistic Regression, Classification, Decision Trees, Gradient Boosting, KNN, Support Vector Machines, Naive Bayes, K-Means Clustering, Neural Networks, Principal Component Analysis and Recommender Systems on structured and unstructured data.

Created statistical models for the collected data, exploratory, pre-processing, to provide conclusions which guides for the policy making decisions.

Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions in MySQL.

Data analysis using analytic tools: Anaconda Jupyter Notebook, R, ggplot2, Dlpr, Car, Mass and Lme4, MatLab, etc.

Professional Experience

Data Scientist

July 2020 – Present

Verizon Richardson, TX

Verizon is a major telecommunications provider. I helped implement a multi=faceted project. Lead and developed a recommendation engine for B2B packages consisting of phone devices, share plans, accessories, and features. Consulted onsite team and co-lead fraud detection for orders placed on the B2B business domain, detecting fraudulent transactions before shipping orders.

Lead meetings with Verizon team to understand the data model or data available for use case

Worked with Oracle Database for EDA analysis

Directed the onshore and offshore (India) teams to sync team with daily meetings

Flattened densely nested json files to extracted critical data

Communicated constantly with data engineering team to provide data science team with required data

Worked with diverse and large team of 11 people

Generated data science experience to provide potential models for use cases

Worked with Power BI for demo presentations

Discussed with clients and project owners on a weekly basis

Reported initial EDA on Jupyter notebooks

Extensive use of Python for code development

Reached goals within tight deadlines

Lead development of machine learning models for onsite team

Used Random Forest and Logistic Regression for Fraud Detection

Consider content based and rule-based models for Project Recommendation

Performed evaluation metrics on the models

Merged the work between the development to productionized team

Carefully worked with data ensuring proper security procedures (VPN, PII data protocol)

Used extensive SQL language to generate analysis and reports

Spark for productionizing models

Implemented version control of code with GitHub

Increased sales of accessories for project recommendation

Reduced detections of fraudulent transactions by 56%

Using A/B testing for package recommendation evaluation for live models

Considered several sampling techniques at initial phase of project when production environment was limited

Data Scientist

April 2019 – June 2020

Avangrid New Haven, CT

Avangrid is a power and utilities provider. This project aimed to provide three major insights: detect which smart meters were connected to transformers as some were incorrectly indicated by the GIS database (GIS has the data which specifies what smart meters are connected to what transformer and some of it is wrong), perform short term load forecasting on the distribution transformers to predict the load of the transformers, and life deterioration analysis on the transformers to determine which transformers are susceptible to failure and how long until they fail. Neural networks (LSTM’s and ANN’s) were used for two of the insights making extensive use of the DeepLearning4j deep learning library. We were able to make deliverables to our customers in a timely manner, providing what they had requested.

Worked in a windows environment and communicated with project on

Wrote code in Scala using IntelliJ to write code for our project application

Packaged a Maven project into a JAR file to create our application and load any code dependencies

Worked in a 6-month long project under high pressure to make deliverables to our customers by the end of the year

Used Zeppelin notebooks to perform exploratory data analysis and gain a better understanding of our data with descriptive statistics and visualizations

Used long short-term memory (LSTM) RNNs to predict the electrical load of distribution transformers

Used statistical analysis such as Pearson correlation to determine the transformer connections to smart meters in residential homes

Used multi-task deep neural networks to determine the life expectancy of distribution transformers

Used Jira for agile project management and coordinate the project sprints and complete team goals in a timely manner

Used spark to do distribution computation and reduce the amount of time to train deep neural networks

Used the DeepLearning4j deep learning library to create deep learning models such as LSTMs and Multi-Task Feed Forward Neural Networks

Talked to clients to understand their business need and verify that our solutions met their needs

Provide mentoring to new data science employees integrated into the team and provide appropriate data science workflow to the team

Held daily meetings to give updates on our progress and ensure proper team communication

Used Git Bash for version control and to push code to the clients GitHub repository

Used DeepLearning4js classes to train the model’s weights and verify that they were properly trained with User Interface visualization functionality.

Worked with spark-submit shell script to manage spark projects and applications

Used Recurrent Neural Networks to perform time series modeling and forecasting

Used Spark SQL to write queries and use results to extract useful information of the data

Worked in the Spark ecosystem to perform SQL queries, deep learning training, and use the machine learning library

Deployed the models using the DeepLearning4j model serializer class to save the trained models

Wrote documentation for the data science team explaining in detail our work

Led weekly meetings with the data science team and business clients to discuss our progress

Used Citrix for network authentication and enterprise app access

Data Scientist

Sept 2017 – April 2019

Monsanto St. Louis, MO

Worked with large data from big data systems in various data stores such as Hadoop HDFS and Amazon Redshift platform.

Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees for estimations

Keras and Tensor Flow used in developing predictive algorithms.

Handled categorical variables with more than 100 levels and created dummy variables for these successfully.

Utilized Advanced Regression Modelling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.

Predictive Modelling, Data Mining Methods, Factor Analysis, ANOVA, Hypothetical testing.

Tried Support Vector Machine and K Nearest Neighbor Classification algorithms.

Analyzed and prepare data, identify the patterns on dataset by applying historical models.

Designed and build production-ready machine-learning Logistic regression.

Worked extensively on Feature Engineering and Exploratory Data Analysis.

Used R and Python for Exploratory Data Analysis, Anova test and Hypothesis testing.

Experienced the full software life cycle in SDLC, Agile and Scrum methodologies.

Created dashboards that helped market analysts spot emerging trends by comparing historical metrics.

Data Scientist

Feb 2016 - Sept 2017

HSBC New York, NY

Development of targeting, activation, incremental sales and look-alike models to increase profit by recommending financial solutions

Used ROC curves and AUC for feature selection.

Defined, designed, documented conceptual, logical data models.

Development of clusters using information from prospect database for enabling marketing initiatives.

Data profiling - validate data quality issues for the critical data elements.

Developed clear definitions for data elements.

Promoted data changes, consistent data naming standards across the Enterprise.

Source - target data Mapping document preparation.

Validated target tables structure, constraints against ETL requirements.

Validated target data against source data based on ETL requirements.

Involved in test data preparation.

Report & Dashboard testing against target tables using SQL queries.

Worked with scikit-learn feature tools for data prep, cleansing and feature engineering

Creation of a target operating model that balances the modelling rigor and product specific priorities and nuances.

Development of dashboards models to support underwriting of various card products for various International markets with a varied maturity of credit bureau.

marketing initiatives for credit card portfolio.

Data mining using state-of-the-art methods.

Enhanced data collection procedures to include information that is relevant for building analytic systems.

Data Scientist

Feb 2015 - Feb 2016

Scotiabank New Orleans, LA

Supervised, Unsupervised, Semi-Supervised classification and clustering of documents

Used key indicators in Python and machine learning concepts like regression, Boot strap Aggregation and Random Forest.

sampling using predictive models enabling improved error detection and cost-efficient sampling.

Developed and deployed machine learning as a service on Microsoft Azure cloud service.

Created credit score establishing framework covering both operational and baseline models.

Developed dashboard to visualize risk of various portfolio scenarios.

Developed dashboard to visualize the credit life cycle for retail unsecured loans.

Delivered machine learning based models for predicting signs of stress among borrowers to challenge a conventional statistical model.

Information used included structured and semi-structured data elements collected from both internal and external sources.

Led the development of segmentation architecture of an energy sector asset portfolio using statistical models to create pools for estimation of returns

Conceptualized and operationalized audit sampling mechanism for a large bank in risk based

Creation of a target operating model that balances the modelling rigor and product specific priorities and nuances.

Data Scientist

Dec 2013 - Feb 2015

BASF Chemicals USA Florham Park, New Jersey

Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them.

Implemented a Python-based distributed random forest via Python streaming.

Applied clustering algorithms i.e., Hierarchical, K-means with help of Scikit and SciPy.

Developed custom predictive algorithms using TensorFlow.

Design and model the reporting data warehouse considering current and future reporting requirement

Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.

Built and analyzed datasets using R, SAS, Matlab and Python.

Extraction and tabulation of data from multiple data sources using R, SAS.

Data cleansing, transformation and creating new variables using R.

Analyzed the source data coming from different sources (SQL Server, Oracle and also from flat files like Access and Excel) and working with business users and developers to develop the Model.

Executed SQL queries to validate actual test results and match expected results as per financial rules.

Responsible for maintaining the integrity of the SQL database and reporting any issues to the database architect.

Involved in the daily maintenance of the database that involved monitoring the daily run of the scripts as well as troubleshooting in the event of any errors in the entire process.

Extensively worked on Python 3.5/2.7 (NumPy, Pandas, MatPlotLib, NLTK and SciKit-Learn)

Education

Texas A&M University at College Station, Texas

B.S Nuclear Engineering, Minor Mathematics Major GPA: 3.58/4.0

Ad Hoc Training

Introduction to R for Data Science edX

Data Science: Visualization edX

Data Science: Probability edX

The Complete Oracle SQL Certification Course Udemy

Master SQL for Data Science Udemy

Tableau 10 A-Z: Hands-on Tableau Training for Data Science Udemy

Oracle Database SQL

Contact this candidate