Luis Trevino
Data Scientist
Phone: 832-***-****
Email: **************@*****.***
Technical Skills
Data Science
Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, Neural Networks, Machine Perception
Programming Languages
Java, Python, R, MatLab
Frameworks
Spark, Spark Streaming, Kinesis, Kafka, Hive, Pig, Scoop, MapReduce, Storm, Ambari
Data Analytics Methods
Advanced Data Modeling, Forecasting Models, Regression Analysis, Predictive Analytics, Statistical Analysis, Sentiment Analysis, Stochastic optimization, hypothesis testing, Time series analysis
A/B Testing, Clustering
Database/Datastores
HDFS, RDBMS, Data Lake, Data Warehouse, Amazon Redshift, Cassandra, Aurora, SQL, PostgreSQL
Machine Learning
TensorFlow, Keras, Theano, Caffe, Glucon, MxNet, PyTorch, Deeplearning4j, CNTK, Keras
Data Query
Azure, Google, Amazon RedShift, Kinesis, EMR; HDFS, RDBMS, SQL and noSQL, data warehouse, data lake and various SQL and NoSQL databases and data warehouses.
Project Management
Agile Scrum
Database
Oracle, RDBMS, SQL, NoSQL, Cassandra, Redshift, Aurora, Mongo
Visualization
R, Tableau, ggPlot2, dygraphs
Years of Experience
•7 years Data Science
•7 Data Analysis
•5 Software Engineering
Professional Summary
Proficiency in Predictive Modeling, Text mining and Machine Learning
Proficient in applying Statistical Modeling and Machine Learning techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, PCA, Ensembles
Experience in Text Mining and good knowledge on NLP components such as Natural Language Understanding (NLU) and Natural Language Generation (NLG).
Knowledge in Hadoop Core Components (HDFS, MapReduce) and Hadoop Ecosystem (Sqoop, Hive, Pig)
Experience in Univariate, Multivariate Analysis, model testing, problem analysis, model comparison and validating model, ANOVA, Regression Analysis.
Expertise in writing complex SQL queries to obtain filtered data for analysis purpose
Working knowledge in implementing tree-based models such as Boosting, Random Forest.
Experience in using Model Pipelines to automate the tasks and put models into production quickly
Experience in using Tableau, creating dash boards and quality story telling.
Worked with various python libraries such as NumPy, SciPy for mathematical calculations, Pandas for data preprocessing/wrangling, Mat plot, Seaborn for data visualization, Scikit-learn for machine learning or Deep leaning and NLTK for NLP.
Develops high-performance algorithms for precision targeting, testing and implementing these algorithms.
Understanding of data structures and algorithms
Experience in scripting languages (Python, R, Scala, Spark)
Development of Artificial neural networks (ANN) and Convolution neural networks (CNN)
Ability to initiate and drive projects from conception to completion with minimal guidance and deliver superior performance
Developed statistical models in R and Python using various supervised and unsupervised machine learning algorithms such as Linear Regression and Logistic Regression, Classification, Decision Trees, Gradient Boosting, KNN, Support Vector Machines, Naive Bayes, K-Means Clustering, Neural Networks, Principal Component Analysis and Recommender Systems on structured and unstructured data.
Created statistical models for the collected data, exploratory, pre-processing, to provide conclusions which guides for the policy making decisions.
Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions in MySQL.
Data analysis using analytic tools: Anaconda Jupyter Notebook, R, ggplot2, Dlpr, Car, Mass and Lme4, MatLab, etc.
Professional Experience
Data Scientist
July 2020 – Present
Verizon Richardson, TX
Verizon is a major telecommunications provider. I helped implement a multi=faceted project. Lead and developed a recommendation engine for B2B packages consisting of phone devices, share plans, accessories, and features. Consulted onsite team and co-lead fraud detection for orders placed on the B2B business domain, detecting fraudulent transactions before shipping orders.
Lead meetings with Verizon team to understand the data model or data available for use case
Worked with Oracle Database for EDA analysis
Directed the onshore and offshore (India) teams to sync team with daily meetings
Flattened densely nested json files to extracted critical data
Communicated constantly with data engineering team to provide data science team with required data
Worked with diverse and large team of 11 people
Generated data science experience to provide potential models for use cases
Worked with Power BI for demo presentations
Discussed with clients and project owners on a weekly basis
Reported initial EDA on Jupyter notebooks
Extensive use of Python for code development
Reached goals within tight deadlines
Lead development of machine learning models for onsite team
Used Random Forest and Logistic Regression for Fraud Detection
Consider content based and rule-based models for Project Recommendation
Performed evaluation metrics on the models
Merged the work between the development to productionized team
Carefully worked with data ensuring proper security procedures (VPN, PII data protocol)
Used extensive SQL language to generate analysis and reports
Spark for productionizing models
Implemented version control of code with GitHub
Increased sales of accessories for project recommendation
Reduced detections of fraudulent transactions by 56%
Using A/B testing for package recommendation evaluation for live models
Considered several sampling techniques at initial phase of project when production environment was limited
Data Scientist
April 2019 – June 2020
Avangrid New Haven, CT
Avangrid is a power and utilities provider. This project aimed to provide three major insights: detect which smart meters were connected to transformers as some were incorrectly indicated by the GIS database (GIS has the data which specifies what smart meters are connected to what transformer and some of it is wrong), perform short term load forecasting on the distribution transformers to predict the load of the transformers, and life deterioration analysis on the transformers to determine which transformers are susceptible to failure and how long until they fail. Neural networks (LSTM’s and ANN’s) were used for two of the insights making extensive use of the DeepLearning4j deep learning library. We were able to make deliverables to our customers in a timely manner, providing what they had requested.
Worked in a windows environment and communicated with project on
Wrote code in Scala using IntelliJ to write code for our project application
Packaged a Maven project into a JAR file to create our application and load any code dependencies
Worked in a 6-month long project under high pressure to make deliverables to our customers by the end of the year
Used Zeppelin notebooks to perform exploratory data analysis and gain a better understanding of our data with descriptive statistics and visualizations
Used long short-term memory (LSTM) RNNs to predict the electrical load of distribution transformers
Used statistical analysis such as Pearson correlation to determine the transformer connections to smart meters in residential homes
Used multi-task deep neural networks to determine the life expectancy of distribution transformers
Used Jira for agile project management and coordinate the project sprints and complete team goals in a timely manner
Used spark to do distribution computation and reduce the amount of time to train deep neural networks
Used the DeepLearning4j deep learning library to create deep learning models such as LSTMs and Multi-Task Feed Forward Neural Networks
Talked to clients to understand their business need and verify that our solutions met their needs
Provide mentoring to new data science employees integrated into the team and provide appropriate data science workflow to the team
Held daily meetings to give updates on our progress and ensure proper team communication
Used Git Bash for version control and to push code to the clients GitHub repository
Used DeepLearning4js classes to train the model’s weights and verify that they were properly trained with User Interface visualization functionality.
Worked with spark-submit shell script to manage spark projects and applications
Used Recurrent Neural Networks to perform time series modeling and forecasting
Used Spark SQL to write queries and use results to extract useful information of the data
Worked in the Spark ecosystem to perform SQL queries, deep learning training, and use the machine learning library
Deployed the models using the DeepLearning4j model serializer class to save the trained models
Wrote documentation for the data science team explaining in detail our work
Led weekly meetings with the data science team and business clients to discuss our progress
Used Citrix for network authentication and enterprise app access
Data Scientist
Sept 2017 – April 2019
Monsanto St. Louis, MO
Worked with large data from big data systems in various data stores such as Hadoop HDFS and Amazon Redshift platform.
Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees for estimations
Keras and Tensor Flow used in developing predictive algorithms.
Handled categorical variables with more than 100 levels and created dummy variables for these successfully.
Utilized Advanced Regression Modelling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.
Predictive Modelling, Data Mining Methods, Factor Analysis, ANOVA, Hypothetical testing.
Tried Support Vector Machine and K Nearest Neighbor Classification algorithms.
Analyzed and prepare data, identify the patterns on dataset by applying historical models.
Designed and build production-ready machine-learning Logistic regression.
Worked extensively on Feature Engineering and Exploratory Data Analysis.
Used R and Python for Exploratory Data Analysis, Anova test and Hypothesis testing.
Experienced the full software life cycle in SDLC, Agile and Scrum methodologies.
Created dashboards that helped market analysts spot emerging trends by comparing historical metrics.
Data Scientist
Feb 2016 - Sept 2017
HSBC New York, NY
Development of targeting, activation, incremental sales and look-alike models to increase profit by recommending financial solutions
Used ROC curves and AUC for feature selection.
Defined, designed, documented conceptual, logical data models.
Development of clusters using information from prospect database for enabling marketing initiatives.
Data profiling - validate data quality issues for the critical data elements.
Developed clear definitions for data elements.
Promoted data changes, consistent data naming standards across the Enterprise.
Source - target data Mapping document preparation.
Validated target tables structure, constraints against ETL requirements.
Validated target data against source data based on ETL requirements.
Involved in test data preparation.
Report & Dashboard testing against target tables using SQL queries.
Worked with scikit-learn feature tools for data prep, cleansing and feature engineering
Creation of a target operating model that balances the modelling rigor and product specific priorities and nuances.
Development of dashboards models to support underwriting of various card products for various International markets with a varied maturity of credit bureau.
marketing initiatives for credit card portfolio.
Data mining using state-of-the-art methods.
Enhanced data collection procedures to include information that is relevant for building analytic systems.
Data Scientist
Feb 2015 - Feb 2016
Scotiabank New Orleans, LA
Supervised, Unsupervised, Semi-Supervised classification and clustering of documents
Used key indicators in Python and machine learning concepts like regression, Boot strap Aggregation and Random Forest.
sampling using predictive models enabling improved error detection and cost-efficient sampling.
Developed and deployed machine learning as a service on Microsoft Azure cloud service.
Created credit score establishing framework covering both operational and baseline models.
Developed dashboard to visualize risk of various portfolio scenarios.
Developed dashboard to visualize the credit life cycle for retail unsecured loans.
Delivered machine learning based models for predicting signs of stress among borrowers to challenge a conventional statistical model.
Information used included structured and semi-structured data elements collected from both internal and external sources.
Led the development of segmentation architecture of an energy sector asset portfolio using statistical models to create pools for estimation of returns
Conceptualized and operationalized audit sampling mechanism for a large bank in risk based
Creation of a target operating model that balances the modelling rigor and product specific priorities and nuances.
Data Scientist
Dec 2013 - Feb 2015
BASF Chemicals USA Florham Park, New Jersey
Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them.
Implemented a Python-based distributed random forest via Python streaming.
Applied clustering algorithms i.e., Hierarchical, K-means with help of Scikit and SciPy.
Developed custom predictive algorithms using TensorFlow.
Design and model the reporting data warehouse considering current and future reporting requirement
Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
Built and analyzed datasets using R, SAS, Matlab and Python.
Extraction and tabulation of data from multiple data sources using R, SAS.
Data cleansing, transformation and creating new variables using R.
Analyzed the source data coming from different sources (SQL Server, Oracle and also from flat files like Access and Excel) and working with business users and developers to develop the Model.
Executed SQL queries to validate actual test results and match expected results as per financial rules.
Responsible for maintaining the integrity of the SQL database and reporting any issues to the database architect.
Involved in the daily maintenance of the database that involved monitoring the daily run of the scripts as well as troubleshooting in the event of any errors in the entire process.
Extensively worked on Python 3.5/2.7 (NumPy, Pandas, MatPlotLib, NLTK and SciKit-Learn)
Education
Texas A&M University at College Station, Texas
B.S Nuclear Engineering, Minor Mathematics Major GPA: 3.58/4.0
Ad Hoc Training
Introduction to R for Data Science edX
Data Science: Visualization edX
Data Science: Probability edX
The Complete Oracle SQL Certification Course Udemy
Master SQL for Data Science Udemy
Tableau 10 A-Z: Hands-on Tableau Training for Data Science Udemy
Oracle Database SQL