Senior Data Scientist

Location:

West Sacramento, CA, 95691

Posted:

December 16, 2024

Contact this candidate

Resume:

Anny Carella

Data Scientist

Phone: 279-***-****

Email: ***********@*****.***

With over 10 years in Data Science and Machine Learning and 12+ years in Information Technology, I am a skilled Data Scientist specializing in machine learning solutions. I bring a strong foundation in project management, leadership, and financial analysis. Proficient in advanced ML techniques such as Linear and Logistic Regression, Decision Trees, and Neural Networks, I am experienced in deploying models on cloud platforms like AWS and Azure. Leveraging a math and physics background, I apply analytical skills to tackle complex challenges. Known for thriving under pressure, I am a quick learner and proactive problem-solver, ready for new challenges.

Summary

•Proficient in Naïve Bayes, Regression, Classification, Neural Networks, Deep Neural Networks, Decision Trees, and Random Forest models.

•Skilled in statistical modeling for large datasets using cloud platforms like AWS, Azure, and GCP.

•Expert in applying statistical and predictive modeling to develop systems for real-time analytics and decision support.

•Adept at creating data-driven solutions for business challenges through statistical analysis, innovative thinking, and predictive modeling.

•Experienced in Exploratory Data Analysis (EDA) to uncover patterns and visualize insights using tools like Matplotlib, Seaborn, and Plotly.

•Proven leader with experience guiding teams to develop machine learning models, APIs, and data pipelines to support business strategy.

•Well-versed in supervised and unsupervised learning techniques for varied analytical applications.

•Experienced in predictive analytics for sales forecasting, employing methods such as ARIMA, ETS, and Prophet for enhanced decision-making.

•Strong communicator with the ability to simplify complex models for team members and stakeholders.

•Proficient in designing, building, and maintaining machine learning pipelines and refreshing models as needed.

•Experienced in applying machine learning and statistical techniques to live data streams from big data sources, utilizing PySpark and batch processing for scalable analysis.

Technical Skills

Libraries: NumPy, SciPy, Pandas, Theano, Caffe, SciKit-learn, Matplotlib, Seaborn, Plotly, TensorFlow, Keras, NLTK, PyTorch, Gensim, Urllib, BeautifulSoup4, PySpark, PyMySQL, SQAlchemy, MongoDB, sqlite3, Flask, Deeplearning4j, EJML, dplyr, ggplot2, reshape2, tidyr, purrr, readr, Apache, Spark.

Artificial Intelligence: Classification and Regression Trees (CART), Support Vector Machine, Random Forest, Gradient Boosting Machine (GBM), TensorFlow, PCA, Regression, Naïve Bayes

Machine Learning Techniques: Supervised Machine Learning Algorithms (Linear Regression, Logistic Regression, Support Vector Machines, Decision Trees and Random Forests, Naïve Bayes Classifiers, K Nearest Neighbors), Unsupervised Machine Learning Algorithms (K Means Clustering, Gaussian Mixtures, Hidden Markov Models, Auto Encoders), Imbalanced Learning (SMOTE, AdaSyn, NearMiss), Deep Learning Artificial Neural Networks, Machine Perception

Analytics: Data Analysis, Data Mining, Data Visualization, Statistical Analysis, Multivariate Analysis, Stochastic Optimization, Linear Regression, ANOVA, Hypothesis Testing, Forecasting, ARIMA, Sentiment Analysis, Predictive Analysis, Pattern Recognition, Classification, Behavioral Modeling

Natural Language Processing: Processing Document Tokenization, Token Embedding, Word Models, Word2Vec, Fast Text, Bag of Words, TF/IDF, Bert, Elmo, LDA

Programming Languages: Python, R, SQL, Java, MATLAB, and Mathematica

Applications: Machine Language Comprehension, Sentiment Analysis, Predictive Maintenance, Demand Forecasting, Fraud Detection, Client Segmentation, Marketing Analysis, Cloud Analytics in cloud-based platforms (AWS, MS Azure, Google Cloud Platform)

Deployment: Continuous improvement in project processes, workflows, automation, and ongoing learning and achievement

Development: Git, GitHub, GitLab, Bitbucket, SVN, Mercurial, Trello, PyCharm, IntelliJ, Visual Studio, Sublime, JIRA, TFS, Linux

Big Data and Cloud Tools: HDFS, SPARK, Google Cloud Platform, MS Azure Cloud, SQL, NoSQL, Data Warehouse, Data Lake, SWL, HiveQL, AWS (RedShift, Kinesis, EMR, EC2, Lambda

Professional Experience

Data Scientist/ GenAI Specialist

Novartis Pharmaceuticals, Sacramento CA Mar 2024 – Current

As a Data Scientist/ GenAI Specialist, I designed and implemented an API for seamless access to Large Language Models (LLMs) and vector databases, facilitating rapid retrieval of healthcare insights. Leveraging technologies such as LangChain, Hugging Face, Transformers, and OpenAI, I delivered key functionalities to support data-driven decision-making in healthcare. Collaborating closely with cross-functional teams and utilizing JIRA for agile project management, I ensured that the chatbot addressed Novartis’s specific industry needs. By integrating various medical and scientific data sources and fine-tuning the model, I significantly enhanced the chatbot’s accuracy and efficiency, elevating the quality of insights and decision support available to healthcare professionals.

Model Development and Deployment:

•Designed, developed, and deployed generative AI models using frameworks like TensorFlow, PyTorch, or similar platforms.

•Implemented continuous integration and continuous deployment (CI/CD) pipelines to streamline model iteration and updates.

•Utilized cloud platforms like Azure Kubernetes Service (AKS) or Azure Machine Learning for scalable, real-time inference.

•Employed Azure Monitor and Log Analytics for tracking model performance and ensuring high availability.

Data Engineering and Pipelines:

•Designed and implemented scalable data pipelines with Azure services like Azure Data Factory, Azure Databricks, and Azure Synapse Analytics to process, store, and analyze large datasets.

•Utilized Azure Blob Storage for cost-effective storage of extensive healthcare datasets, and Azure SQL for structured data management.

•Applied security best practices, leveraging Azure Key Vault for safeguarding credentials and secrets while ensuring data storage and processing compliance with relevant regulations (e.g., HIPAA).

Natural Language Processing and Machine Learning:

•Developed machine learning models using Python libraries like NumPy, pandas, Scikit-learn, and TensorFlow.

•Implemented workflows for data preprocessing, feature engineering, and model evaluation in Python to create robust and effective models for healthcare applications.

•Utilized Python for interactions with large language models (LLMs), handling natural language processing (NLP) tasks and building custom workflows to support healthcare decision-making.

•Leveraged Langchain to streamline the development of LLM-powered applications, combining components like document loaders, vector stores, and retrievers to create efficient query pipelines that retrieve relevant healthcare insights.

•Integrated Langchain with healthcare databases, external APIs, and cloud services, enhancing the capabilities of LLM-based chatbots to deliver context-aware, domain-specific responses to healthcare challenges.

•Developed modular applications using Langchain’s chains, agents, and tools, creating workflows for targeted healthcare problems such as disease diagnosis or drug discovery, and optimizing the chain for real-time medical support.

•Utilized Hugging Face’s pretrained models, such as BERT, GPT, and RoBERTa, to tackle NLP problems specific to healthcare, fine-tuning these models with healthcare data to improve their understanding of complex queries and expert advice.

•Applied Hugging Face’s Transformer models and NLP pipelines for tasks like text classification, sentiment analysis, and entity recognition in healthcare contexts, supporting decisions based on patient feedback, research papers, and clinical data.

•Trained custom models using Hugging Face’s Trainer API and Datasets library, integrating specialized healthcare data to enhance the accuracy of diagnostic tools, treatment recommendations, and drug development.

•Fine-tuned OpenAI’s GPT models (e.g., GPT-4) to address healthcare use cases, tailoring responses to specific medical challenges like personalized treatment plans, patient education, and clinical research.

•Used OpenAI models to generate embeddings from medical documents or queries, enabling efficient similarity search and retrieval with vector databases—especially valuable for systems offering rapid medical advice from historical or research data.

•Developed intelligent chatbots with OpenAI’s APIs to deliver real-time healthcare advice, integrating them with Langchain and Azure services to offer healthcare professionals comprehensive, context-aware solutions, enhancing efficiency and informed decision-making at scale.

Ethical Considerations and Deployment:

•Ensured adherence to ethical standards and compliance with data privacy regulations (e.g., HIPAA) when handling sensitive information or generating content.

•Deployed generative models in production environments and maintained them for performance and scalability, including optimizing inference times and resource usage.

•Presented findings, insights, and recommendations to stakeholders, explaining complex concepts in an understandable manner.

•Participated in continuous learning and development activities, attending workshops and conferences to enhance knowledge and skills in generative AI.

•Conducted exploratory data analysis and experiments to test hypotheses and uncover new opportunities for generative applications.

Lead GenAI/ Data Scientist

Mid-American Energy, Chicago, IL Jun 2023 – Feb 2024

As Lead GenAI/ Data Scientist at Mid-American Energy, I developed regression models on Azure Cloud and Databricks, integrating essential business factors such as marketing expenditure, macroeconomic trends, seasonality, and competition. By optimizing resource allocation, I improved ROI and integrated models with existing systems through Python APIs. Additionally, I built machine learning models to predict Customer Lifetime Value (CLV) and churn rates, providing actionable insights for customer retention strategies. My role also involved leveraging cloud platforms like AWS, Google Cloud, and Azure to ensure scalable and efficient data handling and model deployment.

•Developed regression models on Azure Cloud and Databricks, incorporating factors such as marketing expenditures, macroeconomic indicators, seasonality, and competitive influences.

•Enhanced marketing strategies by integrating models into existing systems through Python-based APIs.

•Optimized resource allocation to maximize ROI and ensured models were maintained in a scalable cloud environment.

•Regularly updated models to reflect changing market trends, leveraging continuous integration and deployment strategies on Azure.

•Document model architectures, training processes, and evaluation results to ensure reproducibility and facilitate knowledge transfer within the team.

•Ensure adherence to ethical standards and compliance with data privacy regulations when handling sensitive information or generating content.

•Deploy generative models in production environments and maintain them for performance and scalability, including optimizing inference times and resource usage.

•Present findings, insights, and recommendations to stakeholders, explaining complex concepts in an understandable manner.

•Participate in continuous learning and development activities, attending workshops and conferences to enhance knowledge and skills in generative AI.

•Conduct exploratory data analysis and experiments to test hypotheses and uncover new opportunities for generative applications.

•Utilized PowerBI, Snowflake, Databricks, and Azure Cloud for comprehensive data analysis and visualization.

•Worked within the market data division, focusing on segmentation and forecasting to deliver insights that supported customer loyalty efforts for member companies.

•Built machine learning models to predict Customer Lifetime Value (CLV) and churn rates, utilizing survival analysis techniques.

•Analyzed historical transactional data to construct predictive models and applied unsupervised learning to segment customers by behavior and characteristics.

•Leveraged Python, R, SQL, and distributed data processing tools such as Apache Hadoop and Spark.

•Used AWS, Google Cloud, and Azure platforms for efficient data handling, model training, and deployment.

•Developed regression models for CLV, classification models for churn prediction, and clustering algorithms for customer segmentation.

•Improved model performance through feature engineering and tuning, assessing effectiveness with metrics like accuracy, precision, recall, F1 score, RMSE, and silhouette score.

•Design, develop, and implement generative AI models using frameworks such as TensorFlow, PyTorch, or similar platforms.

•Gather and preprocess large datasets to train generative models, ensuring data quality and relevance.

•Stay updated on the latest advancements in generative modeling techniques and incorporate state-of-the-art algorithms into projects.

•Conduct experiments to evaluate model performance, including metrics like accuracy, loss, and generative quality, and refine models based on results.

•Work with cross-functional teams, including data scientists, engineers, and product managers, to understand requirements and develop AI solutions that meet business needs.

•Create applications or tools that leverage generative AI models for practical use cases, such as content generation, image synthesis, or predictive analytics.

•Document model architectures, training processes, and evaluation results to ensure reproducibility and facilitate knowledge transfer within the team.

Lead Data Scientist/Machine Learning Engineer

US Cellular, Chicago IL Sep 2021 – May 2023

At US Cellular, I developed a custom CNN model in Python using TensorFlow and Keras, achieving 93% accuracy in object recognition and 90% in document processing by leveraging multimodal fusion with an NLP module. This project involved extensive data preprocessing on over 10,000 images and integrated OCR tools like Tesseract and AWS Text Extract to classify complex visual and text data. The solution streamlined image and document classification, reducing false positives by 40% for enhanced operational efficiency.

•Developed, evaluated, and trained a custom convolutional neural network (CNN) in Python using frameworks such as TensorFlow and Keras.

•Utilized model checkpoints, early stopping, and optimizers like Adam to accelerate the training process.

•Applied image resizing, interpolation to a standard size, and generated rotational and other invariances with the Skimage library.

•Employed the CV2 library for reading and rendering video files.

•Collected and pre-processed a dataset of over 10,000 images, encompassing objects in diverse environments with various lighting conditions, angles, and scanned documents of differing quality and resolution.

•Trained and fine-tuned a CNN architecture using TensorFlow and Keras, achieving an initial object recognition accuracy of 85%.

•Developed an NLP module using spaCy to analyse textual descriptions of objects and contexts, generating additional features for the CNN.

•Integrated the NLP module with the CNN through a multimodal fusion approach, enabling the model to learn from both visual and textual data simultaneously.

•Applied the object recognition system to scanned documents, using Tesseract OCR and AWS Text Extract to classify text, tables, and other relevant content.

•Developed a post-processing module utilizing NLP techniques to analyze and interpret extracted text, generating structured data outputs.

•Evaluated the performance of the NLP-enhanced CNN and document processing system on a holdout set, achieving a final accuracy of 93% in object recognition and 90% in document processing, with a 40% reduction in false positives.

Data Scientist

Nestle, St Louis, Missouri Jan 2019– Aug 2021

As a Data Scientist at Nestlé, I automated data acquisition, modeling, and visualization processes to improve operational efficiency and streamline workflows. I provided internal data science consulting to help business partners identify and address challenges through data-driven solutions, supporting projects from ideation to delivery. By integrating diverse data inputs—such as shipment, order, and promotional data—I enhanced forecasting models for manufacturing plants, achieving a 10% improvement in accuracy and reducing bias by 5%. Additionally, I utilized advanced techniques like PCA, SVM, and various neural networks, leading to significant enhancements in performance, accuracy, precision, and recall rates.

•Automated data acquisition, modeling, and visualization processes to enhance efficiency and simplify workflows.

•Offered internal data science consulting to assist business partners in identifying opportunities and challenges that could be addressed with data-driven solutions, supporting small-scale projects from ideation to execution and delivery.

•Integrated various data inputs (shipment, order, POS, and promotional data) from multiple external sources, including Sales, Marketing, and Operations Planning, to identify potential predictors of customer demand.

•Cleaned and transformed data to prepare datasets for subsequent analysis.

•Developed and improved forecasting models for manufacturing plants and customer accounts using diverse statistical techniques such as regression, ARIMAX, and Exponential Smoothing Methods (ESM), resulting in a 10% improvement in forecast accuracy and a 5% reduction in forecast bias.

•Conducted data preprocessing on Censor Generated and IoT data.

•Employed Principal Component Analysis (PCA) and feature elimination to maintain classification accuracy above 99% in trained models.

•Implemented Support Vector Machines (SVM) for faster training and reduced resource consumption.

•Developed various neural network architectures, including convolutional and recurrent networks, to handle large feature sets.

•Created K-means clustering and Density-Based Spatial Clustering of Applications with Noise (DBSCAN), as well as mixture methods like the multivariate Gaussian mixture model.

•This project resulted in enhanced performance, accuracy, precision, and recall rates.

Data Scientist

HCSC, Richardson TX Feb 2017 – Dec 2018

As a Data Scientist with HCSC, I leveraged a robust tech stack including Python, NoSQL, Docker, AWS, and Kubernetes to enhance data analytics and model development. By utilizing libraries such as NumPy, Pandas, and TensorFlow, I implemented advanced machine learning techniques, including BERT-based embeddings and various neural network architectures. My focus on natural language processing allowed me to create custom word embeddings using NLTK and Gensim. Additionally, I deployed operational models via a RESTful API using Flask, all while adhering to Agile methodologies to ensure timely delivery and iterative improvement.

•Operated within an environment utilizing Python, NoSQL, Docker, AWS, and Kubernetes.

•Engaged with Python libraries such as NumPy, Pandas, SciPy, Matplotlib, Plotly, and Feature Tools for data analysis, cleaning, and feature engineering.

•Utilized NLTK and Gensim for natural language processing tasks like tokenization and for generating custom word embeddings.

•Integrated Python’s TensorFlow library for constructing neural network models.

•Implemented embeddings based on BERT.

•Leveraged a variety of models, including Convolutional Neural Networks, Recurrent Neural Networks, LSTM, and Transformers.

•Deployed operational models to a RESTful API using the Python Flask framework and Docker containers.

•Adopted Agile methodologies, including Extreme Programming, Test-Driven Development, and Agile Scrum.

Data Scientist

Citizens Bank, Boston, MA June 2014 – Jan 2017

As a Junior Data Scientist at Citizens Bank, I developed and integrated logistics and linear regression models tailored to meet various internal requirements. By identifying data patterns through advanced statistical techniques and algorithms, I proposed innovative solutions that combined business insights with mathematical concepts. I utilized predictive modeling methods like ARIMA and ETS, along with decision trees and random forests, to validate variables in regression models. My findings were presented weekly to stakeholders, ensuring alignment with business needs, and I deployed the final model in a Flask application on AWS, making it accessible via a REST API.

•Developed and integrated logistics and linear regression models, aligning various internal requirements for covariance and variable criteria.

•Identified patterns in data using algorithms and employed experimental and iterative methods to validate results.

•Demonstrated creative thinking and a strong ability to generate and propose innovative solutions to problems by leveraging business insight, mathematical concepts, data models, and statistical analysis.

•Utilized advanced statistical and predictive modeling techniques to construct, maintain, and enhance real-time decision systems using ARIMA, ETS, and Prophet.

•Employed decision trees and random forests to assess the validity of variables used in regression models, incorporating additional techniques such as bagging and boosting (AdaBoost, XGBoost) to enhance these models.

•Presented findings and recommendations to business stakeholders on a weekly basis, incorporating feedback and features in response to the evolving needs of the business within a dynamic social landscape.

•Conducted model evolution across competing groups to identify the best model for further refinement.

•Deployed the final model in a Flask application on AWS, accessible via a REST API.

Data Analyst

Hubspot, Cambridge MA June 2012 – May 2014

•Collect information from diverse sources, including databases, APIs, and customer relationship management (CRM) systems.

•Cleanse and organize data for analysis, ensuring precision and uniformity.

•Recognize and resolve data quality concerns.

•Apply statistical methods (e.g., descriptive analytics, hypothesis testing, regression techniques) to examine data trends and patterns.

•Create and maintain data models and visual representations (e.g., dashboards, graphs) to effectively communicate results.

•Identify key performance metrics (KPIs) and monitor their performance over time.

•Generate detailed reports that offer actionable insights for stakeholders.

•Work collaboratively with business teams to comprehend their data requirements and develop customized reporting solutions.

•Convert data insights into actionable recommendations for enhancing business processes and strategies.

•Aid in decision-making by delivering data-driven evidence.

•Ensure compliance with data governance standards and regulations (e.g., GDPR, CCPA).

•Safeguard data privacy and security.

Education

Master of Science in Business Analytics (MSBA)

Emory University, Goizueta Business School

Master of Science in Information Technology

Carnegie Mellon University, Carnegie Institute of Technology

Contact this candidate