Resume

Sr. Data Scientist

Location:

Ewing, NJ

Posted:

December 08, 2022

Contact this candidate

Resume:

Fernando Penaloza

Sr. Data Scientist

●Email: adti6f@r.postjobfree.com

●Phone: 510-***-****

Professional Summary

•Senior Data Scientist with over 9 years of experience working in the Data Science and Machine Learning field.

•Authored 15+ Publications and 1 patent in the domain of Bio-Informatics and Deep Learning field

•Hands-on experience in applying machine learning techniques such as Naïve Bayes, Linear, and Logistic Regression Analysis, Neural Networks, RNN, CNN, Transfer Learning, Time-Series Analysis, Trees, and Random Forests.

•Brilliant in performing exploratory analysis on varying types of data and datasets, allowing for a full knowledge of the subject matter, an understanding of the variables in question, and a technically sound insight into the required modeling approach.

•Successfully developed logical data architectures and ensured adherence to enterprise architectural schemas.

•Proven analytical skills and applications of Bayesian Analysis, Inference, Time-Series Analysis, Regression Analysis, Linear models, Multivariate analysis, Gradient Descent, Sampling methods, Forecasting, Segmentation, Clustering, Sentiment Analysis, and Predictive Analytics.

•Familiarity with Azure, Google, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite, Data Warehouse, Data Lake. Optimize SQL performance, integrity, and security of the project’s databases/schemas

•Skilled in creating machine learning algorithms using Logistic Regression, Random Forest, XGboost, KNN, SVM, Neural Networks, Linear Regression, Lasso Regression, and K-Means.

•Well-versed in designing and presenting interactive data visualizations and widgets in Python using Matplotlib, Ggplot2, Plotly, and Seaborn, and in R using Tidyverse and R Shiny for visualization.

•Experienced in producing Custom BI reporting dashboards in Python using Dash with Plotly for rapid dissemination of actionable, data-driven insights.

•Understanding a variety of NLP methods for information extraction, topic modeling, parsing, and relationship extraction in Python.

•Develop, deploy, and maintain production NLP models with scalability in mind.

•Apply Natural Language Processing with NLTK, SpaCy, and other modules for application development for automated customer response.

•Skilled with Python, and R to develop neural networks, and cluster analysis.

•Demonstrated skill with Natural Language (NLP, NLG), Machine Learning, Deep Learning, AI, IoT, Predictive Analytics, and Neural Nets.

•Hands-on implementation of Clustering, Neural Networks, LDA (Linear Discriminant Analysis), Naïve Bayes, Random Forests, Decision Trees, Linear Regression, Logistic Regression, SVM, Principal Component Analysis, Multidimensional scaling (MDS), and Recommender Systems.

•Skilled in applying DplyR and Pandas in R and Python for exploratory data analysis.

Technical Skills

•Python Packages: Pandas, TensorFlow, Numpy, Scikit-learn, PyTorch, SciPy, Matplotlib, Seaborn, re, NLTK, gensim, spaCy.

•R Packages: tidyverse (readr, DBI, tidyr, lubridate, dplyr, tibble, purr, stringr, glue, tidymodels, ggplot2), caret, shiny, bioconductor, knitr, and RMarkdown.

•Deep Learning: Machine Learning algorithms, Neural Networks, Machine Perception, Data Mining, RNN, CNN, Transfer learning, TensorFlow, Keras. PyTorch.

•Analysis Methods: Predictive Analytics, Decision Analytics, Advanced Data Modeling, Bayesian Analysis, Statistical, Exploratory, Inference, Regression Analysis, Multivariate analysis, Sampling methods, Forecasting, Segmentation, Clustering, Sentiment Analysis, Design and Analysis of Experiments, Factorial Design and Response Surface Methodologies, Optimization, and State-Space Analysis.

•Analysis Techniques: TensorFlow, PCA, RNN including LSTM, CNN, Transfer learning, Random Forest, Classification and Regression Trees (CART), Gradient Boosting Machine (GBM), Linear and Logistic Regression, Naïve Bayes, Simplex, Markov Models, and Jackson Networks.

•Data Query: Azure, Google, MySQL, Microsoft SQL Server, PostgreSQL, SQLite, Data Warehouse, Data Lake.

•Machine Learning: Natural Language Processing and Understanding, Machine Learning algorithms (text recognition, image classification, and forecasting).

•Data Modeling: Predictive Modeling, Stochastic Modeling, Linear Modeling, Behavioral Modeling, Bayesian Analysis, Statistical Inference, Probabilistic Modeling, and Time-Series Analysis.

•Artificial Intelligence: Text Understanding, Classification, Pattern Recognition, Recommendation Systems, Targeting Systems, Ranking Systems, and Time Series Analyses.

•Applied Data Science: Natural Language Processing, Machine Learning, Text Recognition, Image Classification, Social Analytics, Predictive Maintenance.

•Analytic Development: Python, R-Programing, SQL, Excel.

•IDE: Jupyter, Spyder, RStudio, Google Colab.

•Version Control & PM tools: Git, GitHub, Jira, Kanban.

Professional Experience

Senior Data Scientist (Team Lead)

J&J - Titusville, NJ

May 2022 - Present

Environment: AWS, Dataiku

Technologies: EKS, EMR, Dataiku, JIRA, Confluence, Bitbucket, PyCharm, sklearn, XGBoost.

Project Summary: Data Science lead for Tremfya PsA and PsO, implementation of Machine Learning algorithms as part of the Commercial Excellence team. Coordinated requirements of the Advance analytics, sale operations, data engineering, and infrastructure teams to achieve the project deliverables

Managed a team of data scientists, machine learning engineers, and big data specialists

Developed different machine learning models (XG Boost regression and classification), in the field of prescriptive analytics to increase the ROI for 2 brands, Tremfya PsA and PsO.

Tested and deployed machine learning models using Dataiku and the AWS infrastructure. Coordinated and translated the requirements of the business partners to data science solutions.

Led data mining and collection procedures to ensure data quality and integrity

Interpreted, and analyzed business problems and conceived, planned, and implemented data projects

Built analytic systems and predictive models and tested the performance of data-driven products

Ensured alignment with key technology and business stakeholders across globally diverse, Agile teams.

Designed and presented interactive data visualizations and widgets in Python using Matplotlib, Ggplot2, Plotly, and Seaborn, and in R using Tidyverse and R Shiny for visualization.

Created reports delivering key business insights

Experimented with new models and techniques

Recommended the best practices for utilizing Dataiku resources, implemented an optimization, and reduced resource utilization.

Coordinated KT (knowledge transfer) sessions for newly joined members to the technologies I was using on the project.

Senior Data Scientist & Machine Learning Engineer

Kaiser Permanente - Oakland, CA (REMOTE)

June 2020 – Apr 2022

The initial focus of the team was to build a Computer Vision model to identify and predict possible cancer remission based on patients’ medical information. However, because of the COVID-19 pandemic, the team’s focus was redirected to building a classifier capable of separating cases of COVID-19 from other respiratory issues. To accomplish this, a combination of CT scan images and other demographic data was used to achieve a novel 80% accuracy in the early stages of the pandemic. Part of the data involved visual diagnosis of lung diseases using Computer Vision among other ML techniques. Additionally, the data team worked on other ML use cases for NLP.

Produced cancer and COVID-19 diagnoses based on patient demographic, lungs, and tumor size, shape, and location data.

Predictions were used to recommend and optimize patient treatment plans by medical professionals.

Employed PyTorch, Scikit-Learn, and XGBoost libraries to build and evaluate the performance of different models.

Data pulled from an internal Hadoop Data Lake (AWS).

CT scan image data analyzed using Convolution Neural Networks (CNNs).

Used common NLP techniques such as pre-processing (tokenization, part-of-speech tagging, parsing, stemming).

Datasets consisted of a split of COVID and non-COVID PCR amplification graphs (10K training, 100K testing).

Designed and implemented a WEB LIMS (Laboratory Information Management System) platform focused on the processing of SARS-CoV-2 samples. This resulted in a 30% increase in productivity.

Implemented a Machine Learning algorithm with 99.99% accuracy to classify SARS-CoV-2 (COVID-19) PCR results.

Supported the design and statistical analysis of multiple research projects.

Hands-on with AWS Sagemaker, TensorFlow, Keras, and PyTorch Deep Learning tools.

Ensured alignment with key technology and business stakeholders across globally diverse, Agile teams.

Involved with continuous integration and continuous development analysis and process setup install.

Ensured that all MLOps processes required for CI-CD are documented and followed by the development team.

Genetics & Research Data Scientist

MD Anderson Cancer Center (University of Texas) - Houston, TX

November 2018 - June 2020

In collaboration with Mexico National Cancer Institute, I designed and implemented multiple pipelines for processing Next Generation Sequencing (NGS) like WGS, Exome-Seq, and RNA-Seq data to aid in the identification of actionable cancer-related mutations. Supported the design and statistical analysis of multiple research projects. Developed Machine Learning methods for cancer diagnosis. For this project, a diagnostic method using more easily obtained metrics was developed using data from 10,000 patients. Features included several biometrics. Imputation of missing values was performed by the usage of linkage disequilibrium and haplotype information. A correlation matrix was made for feature selection. Oversampling and other feature engineering techniques were performed on the positive minority class. Several models were tested to finally select the most optimal ones (ANN).

Applied scientific and business analytics skills, integrated and prepared large, varied datasets, and communicated results.

Worked with specialized database architecture and cloud computing environments.

Developed analytic approaches to strategic business and clinical decisions.

Worked with Azure, Google, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite, Data Warehouse, Data Lake.

Optimized SQL performance, integrity, and security of the project’s databases/schemas

Performed analysis using predictive modeling, data/text mining, and different statistical tools.

Presented and designed collaborative data visualizations and widgets in Python using Matplotlib, Ggplot2, Plotly, and Seaborn, and in R using Tidyverse and R Shiny for visualization.

Built predictive modeling using Machine Learning algorithms such as Random Forests, Naive Bayes, Neural Networks, SVM, NLP techniques, Ensemble Modeling, GB, etc.

Worked with Big Data infrastructure and tools such as Hive and Spark.

Applied statistics and organized large datasets of both structured and unstructured data.

Worked with applied statistics and applied mathematics tools for performance optimization.

Data Scientist & Machine Learning Engineer

Home Depot - Atlanta, GA

January 2016 – November 2018

As a Data Scientist for Home Depot Decision Science Division, I was assigned to help them predict their customer profitability for different areas, also optimizing marketing campaign costs. The challenge was to improve the Model’s score to assess optimized marketing spending per customer and channel. The solution involved customer segmentations, and different ensemble techniques to make a proper prediction. Several recommender systems were applied for different channels. The team applied an Agile methodology with daily standups and bi-weekly presentations. My main responsibility was as primary modeler development and deployment for operations and marketing activities.

Clustered customers and operations with K-means, hierarchical clustering, and others.

Fit clusters to an XGBoost Regression tree to make the final prediction and target price goals based on the statistical likelihood of achieving that goal based on the customer’s past behavior.

Applied Kalman filters for Time series models for customer spending predictions.

Built the model on an Azure Notebook platform.

Programmed model functions using Python.

Fit several preliminary Bayesian and machine learning models in R and Python (with PySpark for data retrieval in Python) for improved understanding of data and feature selection.

Applied and ran search and decision algorithms such as XGBoost, CatBoost, and LightGBM.

Conducted regression tests using KNN regression.

Applied Long Short-Term Memory (LSTM)/Recurrent Neural Network (RNN) architectures to deep learning.

Managed SQL Server databases through multiple product lifecycle environments, from development to mission-critical production systems

Created, deployed, and ran container applications using Docker, and Clusters on Kubernetes.

Implemented MLOps processes involving the CI-CD pipeline

Well-versed with Azure Cloud Tools like:

oAzurePing

oCloud Explorer for Visual Studio

oCloud Combine

oSQL Database Migration Wizard

oAzure Blob Studio

oMicrosoft Azure Storage Connected Service

oGraph Engine VS Extension

oDocker

Data Scientist - Bioinformatics

Leibniz Institute for Zoo and Wildlife Research (Leibniz-Institut für Zoo- und Wildtierforschung) - Berlin, GERMANY

August 2014 – January 2016

The Leibniz-IZW is an internationally renowned German research institute. I was part of the Research Biostatistician team for the development of the latest ML techniques for wildlife conservation. Utilized Time Series, XGBoost, DNN, etc. Co-author of 3 scientific publications.

Performed data analysis of Next Generation Sequencing data with a focus on Phylogenomic and conservation biology.

Designed and presented interactive data visualizations and widgets in Python using Matplotlib, Ggplot2, Plotly, and Seaborn, and in R using Tidyverse and R Shiny for visualization.

Used Python to retrieve and clean the data before implementing and model training.

Wrote functions to perform pre-processing to impute the missing values using a linear interpolation technique.

Applied normalization of features in the data to reduce noise and maximize the signal-to-noise ratio.

Managed SQL Server databases through multiple product lifecycle environments, from development to mission-critical production systems

Handled feature reduction using Principal Component Analysis (PCA) to minimize the data.

Used Multiple Linear Regression for different use cases

Standardized ETL processes and tools to automate data extraction and manual reporting, thereby reducing man hours across the entire operation

Research Intern (Data analytics & Bioinformatics)

National History Museum of Denmark - Copenhagen, DENMARK

December 2012 – August 2014

Elaborated on several research projects where I co-authored 5 scientific publications. Data analysis of Next Generation Sequencing data with a focus on Phylogenomics, evolutionary genomics, and ancient DNA.

Use ML techniques to predict evolutionary relationships between different animal and plant species collected from historical datasets for research.

Produced comprehensive reports and documentation written in LaTeX and subsequently presented to the technical and non-technical panels.

Presented my results at The Society for Molecular Biology and Evolution (SMBE) annual conference in San Juan, Puerto Rico (2014).

Designed interactive data visualizations and widgets in R using Tidyverse and R Shiny for visualization.

Managed SQL Server databases through multiple product lifecycle environments, from development to mission-critical production systems

Education

Ph.D. in Biomedical Sciences (Doctoral Candidate) - UNAM

•Genomics

•Statistical and Deep learning analysis

B.S. in Genomic Sciences – UNAM

•Copenhagen University/Denmark

DataCamp Profile

https://www.datacamp.com/profile/fpenaloz

Certifications

Data Scientist - Professional Certificate - DataCamp

Data Analyst - Professional Certificate - DataCamp

Data Analytics - Professional Certificate - Google

Project Management - Specialization - University of California Irvine

Project Management - Professional Certificate - Google

Machine Learning - Specialization - University of Washington (In Progress)

SCIENTIFIC PUBLICATIONS

https://scholar.google.com/citations?user=qimJiyMAAAAJ

Comparative performance of two whole genome capture methodologies on ancient DNA Illumina libraries

MC Avila-Arcos, M Sandoval-Velasco, H Schroeder, ML Carpenter, ...

Methods in Ecology and Evolution.

The limits and potential of paleogenomic techniques for reconstructing grapevine domestication

N Wales, JR Madrigal, E Cappellini, AC Baez, JAS Castruita, ...

Journal of Archaeological Science 72, 57-70

An expanded mammal mitogenome dataset from Southeast Asia

F Mohd Salleh, J Ramos-Madrigal, F Peñaloza, S Liu, SS Mikkel-Holger, ...

GigaScience 6 (8), gix053

Genome Evolution in Three Species of Cactophilic Drosophila

A Sanchez-Flores, F Peñaloza, J Carpinteyro-Ponce, N Nazario-Yepiz, ...

G3: Genes, Genomes, Genetics 6 (10), 3097-3105

A draft genome sequence of the elusive giant squid, Architeuthis dux

RR Da Fonseca, A Couto, AM Machado, B Brejova, CB Albertin, F Silva, ...

GigaScience 9 (1), giz152

Saliva is a reliable and accessible source for the detection of SARS-CoV-2

LA Herrera, A Hidalgo-Miranda, N Reynoso-Noverón, ...

International Journal of Infectious Diseases 105, 83-90

Evolutionary history and conservation significance of the Javan leopard Panthera pardus melas

A Wilting, R Patel, H Pfestorf, C Kern, K Sultan, A Ario, F Peñaloza, ...

Journal of Zoology 299 (4), 239-250

The evolutionary landscape of SARS-CoV-2 variant B. 1.1.519 and its clinical impact in Mexico City

A Cedro-Tanda, L Gómez-Romero, N Alcaraz, G de Anda-Jauregui, ...

Viruses 13 (11), 2182

Whole-genome sequence analysis of multidrug-resistant uropathogenic strains of Escherichia coli from Mexico

GL Paniagua-Contreras, E Monroy-Pérez, CE Díaz-Velásquez, ...

Infection and Drug Resistance 12, 2363

Population genomic footprints of environmental pollution pressure in natural populations of the Mediterranean mussel

AM Ribeiro, CA Canchaya, F Penaloza, J Galindo, RR da Fonseca

Marine genomics 45, 11-15

Analytical performances of the COVISTIX and Panbio antigen rapid tests for SARS-CoV-2 detection in an unselected population

F Garcia-Cardenas, A Franco, R Cortes, J Bertin, R Valdez, F Penaloza, ...

medRxiv

Complete mitochondrial genomes of the Laotian Rock Rat (Laonastes aenigmamus) confirm deep divergence within the species

M Le, F Penaloza, R Martins, TV Nguyen, HM Nguyen, DX Nguyen, ...

Mitochondrial DNA Part B 1 (1), 479-482

Contact this candidate