Senior Data Scientist

Location:

Waltham, MA, 02451

Posted:

October 10, 2024

Contact this candidate

Resume:

Ghilas Bettar

Contact: 781-***-****; Email: ***************@*****.***

LEAD DATA SCIENTIST /Gen AI ENGINEER

PROFILE SUMMARY

•Lead Data Scientist with 12+ years of experience in processing and analyzing data across a variety of industries and overall 17+ years in IT.

•Leverages various mathematical, statistical, and machine learning tools to collaboratively synthesize business insights and drive innovative solutions for productivity, efficiency, and revenue across Healthcare, Banking, Insurance, Research, Energy etc.

•Extensive experience in 3rd-party cloud resources: AWS, Google Cloud, and Azure.

•Working with and querying large data sets from big data stores using Hadoop Data Lakes, Data Warehouse, Amazon AWS, Redshift, and NoSQL.

•Ensemble algorithm techniques, including Bagging, Boosting, and Stacking; knowledge with Natural Language Processing (NLP) methods, in particular BERT, GPT, word2vec for embeddings, sentiment analysis, Name Entity Recognition, and Topic Modelling Time Series Analysis with ARIMA, LSTM, RNN, and Prophet.

•Design and Build data products includes data security, insider risk management, communication compliance, and data lifecycle management, supported by Copilot for Security innovations.

•Experience in the entire data science project life cycle and actively involved in all the phases, including data extraction, data cleaning, statistical modeling, and data visualization with large data sets of structured and unstructured data.

•Demonstrated excellence in using various packages in Python and R like Pandas, NumPy, SciPy, Matplotlib, Seaborn, TensorFlow, Scikit-Learn, and ggplot2.

•Skilled in statistical analysis programming languages such as R and Python (including Big Data technologies such as Spark, Hadoop, Hive, HDFS, and MapReduce).

•Understanding of applying Naïve Bayes, Regression, and Classification techniques as well as Neural Networks, Deep Neural Networks, Decision Trees, and Random Forests.

•Performing EDA to find patterns in business data and communicate findings to the business using visualization tools such as Matplotlib, Seaborn, and Plotly.

•Experience in Tracking defects using Bug tracking and Version control tools like Jira and Git.

•Adept at applying statistical analysis and machine learning techniques to live data streams from big data sources using PySpark and batch processing techniques.

•Leading teams to produce statistical or machine learning models and create APIs or data pipelines for the benefit of business leaders and product managers.

•Experience in Tracking defects using Bug tracking and Version control tools like Git.

•Strong experience in interacting with stakeholders/customers, gathering requirements through interviews, workshops, and existing system documentation or procedures, defining business processes, identifying, and analyzing risks using appropriate templates and analysis tools.

•Good knowledge of creating visualizations, interactive dashboards, reports, data stories using Tableau and Power BI.

•Excellent communication, interpersonal, intuitive, analysis, and leadership skills, a quick starter with the ability to master and apply new concepts.

•Large Language Model fine tuning and training. Extensive experience hands on with BERT, PALM and Azure Open AI studio alongwith GPT 3, 3.5 and GPT-4.

TECHNICAL SKILLS

Libraries:

NumPy, SciPy, Pandas, Theano, Caffe, SciKit-learn, Matplotlib, Seaborn, Plotly, TensorFlow, Keras, NLTK, PyTorch, Gensim, Urllib, BeautifulSoup4, PySpark, PyMySQL, SQAlchemy, MongoDB, sqlite3, Flask, Deeplearning4j, EJML, dplyr, ggplot2, reshape2, tidyr, purrr, readr, Apache, Spark.

Machine Learning Techniques:

Supervised Machine Learning Algorithms (Linear Regression, Logistic Regression, Support Vector Machines, Decision Trees and Random Forests, Naïve Bayes Classifiers, K Nearest Neighbors), Unsupervised Machine Learning Algorithms (K Means Clustering, Gaussian Mixtures, Hidden Markov Models, Auto Encoders), Imbalanced Learning (SMOTE, AdaSyn, NearMiss), Deep Learning Artificial Neural Networks, Machine Perception

Analytics:

Data Analysis, Data Mining, Data Visualization, Statistical Analysis, Multivariate Analysis, Stochastic Optimization, Linear Regression, ANOVA, Hypothesis Testing, Forecasting, ARIMA, Sentiment Analysis, Predictive Analysis, Pattern Recognition, Classification, Behavioral Modeling

Natural Language Processing:

Processing Document Tokenization, Token Embedding, Word Models, Word2Vec, FastText, BagOfWords, TF/IDF, Bert, Elmo, LDA

Programming Languages:

Python, R, SQL, Java, MATLAB, and Mathematica

Applications:

Machine Language Comprehension, Sentiment Analysis, Predictive Maintenance, Demand Forecasting, Fraud Detection, Client Segmentation, Marketing Analysis, Cloud Analytics in cloud-based platforms (AWS, MS Azure, Google Cloud Platform)

Deployment:

Continuous improvement in project processes, workflows, automation, and ongoing learning and achievement

Development:

Git, GitHub, GitLab, Bitbucket, SVN, Mercurial, Trello, PyCharm, IntelliJ, Visual Studio, Sublime, JIRA, TFS, Linux

Big Data and Cloud Tools:

HDFS, SPARK, Google Cloud Platform, MS Azure Cloud, SQL, NoSQL, Data Warehouse, Data Lake, SWL, HiveQL, AWS (RedShift, Kinesis, EMR, EC2, Lambda

PROFESSIONAL EXPERIENCE

Lead Data Scientist/ Gen AI Engineer with Thermo Fisher Scientific Inc., Waltham, Massachusetts

Since June 2023

As a Lead Data Scientist/Gen AI at Thermo Fisher Scientific Inc., I spearheaded projects developing AI algorithms for personalized treatment plans and defining customer segments through clustering algorithms, boosting sales by 11%. I optimized AI models using A/B testing to maximize ROI and operational efficiency, while leveraging big data analytics to improve treatment efficacy by analyzing genetic and clinical data. I have also designed an automated Document Summarizer for Multi-Format Inputs through a chatbot.

Responsibilities:

•Integrated the chatbot with LangChain for seamless prompt management and text processing, enabling summarization of plain text, PDF, and Word documents.

•Building data analytics bot to communicate with the database and extracting business insights using charts and plots

•Rolling out fully functional scalable chat application used by thousands of users built using Langchain tools, agents and vector databases

•Enabled handling of large documents by breaking them into manageable chunks, ensuring optimal performance within API token limits.

•Provided users with options to input text directly or upload documents (PDF/Word) for summarization, ensuring a user-friendly experience.

•Designed Chunking to Embeddings and storing them in VectorDB using PineCone.

•Utilized embeddings to capture semantic meaning of text, enabling advanced document retrieval and similarity search within the summarization system.

•Enabled scalable storage and retrieval of document vectors, allowing for fast and accurate similarity-based searches and enhanced document summarization capabilities using GPT 3.5 Turbo and 4.

•Built a chatbot on patient’s healthcare records using pre trained LLM models and integrated RAG pipelines to generate contextual based responses in output.

•Developed complex analytics pipelines using Apache Spark and Apache Flink for batch processing and real-time data processing from medical devices and sensors.

•Leveraged AWS Glue for serverless ETL tasks, simplifying data transformation processes and reducing operational overhead.

•Proficiently utilized AWS SageMaker for comprehensive machine learning model development, including data preprocessing, feature engineering, model training, deployment, and monitoring with ensuring scalability and high availability.

•Optimized machine learning models on SageMaker, ensuring efficient resource use and cost-effectiveness.

•Designed and managed data workflows using Apache Airflow, creating and maintaining DAGs to automate complex data pipelines.

•Used Spark's RDD and DataFrame APIs for distributed data processing, integrating machine learning libraries such as Apache Mahout and TensorFlow for model training and evaluation.

•Used Git for version control and collaboration on GitHub.

•Built solutions using LLMs (Transformers, BERT, GPT, BART) and designed a penetration attribution model using LGBM for ROI optimization.

•Developed an interactive StreamLit application to run intervention simulations, visualize causal relationship graphs, and communicate insights to stakeholders.

•Created regression models for different products (SKUs) to generate pricing and promotion elasticities.

•Implemented models to predict KPIs, increasing customer retention by 52% through precision-based retention prediction.

•Developed several ready-to-use machine learning model templates, assigning clear descriptions of purpose and input variables.

•Developed various machine learning models using Python libraries (Pandas, NumPy, Seaborn, Matplotlib, SciKit-learn) and built/analysed datasets using Python and R.

•Applied linear regression in Python and SAS to understand relationships and causal connections between dataset attributes

Sr. Data Scientist/ Computer Vision Engineer with Elevance Health, Indianapolis, Indiana

Jun 2021 – June 2023

As a Sr. Data Scientist/ Computer Vision Engineer with Elevance Health, I Developed an OCR-based system to extract patient information from medical documents, streamlining data entry and improving patient care. Enhanced RNN-based OCR models with CTC loss function for better text recognition in variable-length documents. Automated code extraction for claims processing using OCR and NLP models, reducing manual intervention and increasing accuracy. Designed and deployed advanced OCR models on Google Kubernetes Engines and Vertex AI, achieving a 75% reduction in manual data entry and a 20% decrease in denied claims.

•Develop AI algorithms for personalized treatment plans by analyzing genetic, clinical, and lifestyle data to improve treatment efficacy and reduce adverse reactions.

•Developed an OCR-based system to automatically extract patient information from medical documents, streamlining data entry and enhancing patient care.

•Enhanced RNN-based OCR models by implementing the connectionist temporal classification (CTC) loss function to improve text recognition in variable-length documents.

•Applied OCR and NLP models to automate the extraction of codes and information required for claims processing, minimizing manual intervention and increasing speed and accuracy.

•Designed and deployed advanced OCR models using deep learning techniques, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to extract data from diverse healthcare forms, including prescriptions, lab reports, and patient information forms.

•Deployed OCR and NLP models on Google Kubernetes Engines (GKE) and Google Vertex AI for scalable and efficient healthcare document processing.

•Utilized transfer learning techniques to fine-tune pre-trained CNN and RNN models on healthcare-specific datasets, significantly reducing training time and increasing accuracy.

•Achieved a 75% reduction in manual data entry by developing OCR models that can extract data from various healthcare forms.

•Enhanced claims processing efficiency by using OCR technology to extract text from insurance claims, reducing denied claims by 20%.

•Implemented a hybrid OCR system combining traditional rule-based OCR with deep learning-based OCR for improved accuracy and document handling flexibility.

•Created ensembles of CNN and RNN models for text recognition in images, significantly improving OCR system accuracy and robustness.

•Applied NLP techniques, such as named entity recognition (NER) and text classification, to extract structured data from unstructured text, enhancing OCR system accuracy.

•Deployed OCR and NLP models using Vertex AI for model deployment, monitoring, and management in a production environment.

•Utilized Google Cloud Functions to trigger OCR and NLP models for real-time healthcare document processing.

•Set up Azure Log Analytics for monitoring and troubleshooting OCR and NLP models in production.

•Leveraged Azure Cognitive Services, including Computer Vision and Text Analytics, to enhance OCR and NLP model capabilities.

•Maintained a track record of integrating the latest advancements in AutoML methodologies and Azure Cognitive Services to enhance model performance and stay at the forefront of industry best practices.

•Employed tools like TensorFlow, PyTorch, NLTK, and Spacy to implement NLP and deep learning techniques in e-commerce applications.

•Developed custom NLP models to extract specific information from medical documents, such as patient demographics, diagnosis, and treatment information.

•Used Azure Blob Storage for storing and managing large volumes of healthcare documents and OCR output.

Lead Data Scientist with Phillips 66, Westchase, Houston, Tx

June 2018 – June 2021

As a Lead Data Scientist at Phillips 66, I was responsible for developing the model to support data driven initiatives across the company. I developed the model at every stage, including data ingestion, model development, training, management, championing automation throughout the entire ML pipeline through python. In creating solutions, I carefully considered the end user's skill set, ensuring tools were designed and documented appropriately.

Responsibilities:

•Engineered predictive maintenance models and conducted survival analysis using classification and regression methods, significantly reducing machinery failure rates and maintenance costs.

•Built a time series model using Arima to forecast the demand of power generation in various sectors like residential, commercial, industrial etc.

•Assisted in monitoring sharp market movements and synergizing with other optimization efforts by the SES team.

•Utilized XGBoost to forecast power prices for the next three months, enhancing the revenue bucket.

•Developed regression models for LNG forecasting on daily, monthly, and yearly bases, deploying them on the dedicated platform in Azure.

•Integrated ML models with Kubernetes-based infrastructure, leveraging EKS for efficient deployment and scaling in production environments.

•Designed and implemented a comprehensive ML workflow using MLFlow, AWS Batch, Docker, and Kubernetes, streamlining and scaling model training and deployment processes.

•Created custom Docker containers for ML model packaging, ensuring consistent and reproducible environments across different ML lifecycle stages.

•Employed MLFlow to track and manage experiments, enabling easy comparison of models and hyperparameters, facilitating team collaboration.

•Implemented automated model training pipelines with AWS Batch for efficient parallel processing of large datasets, reducing training time.

•Conducted comprehensive training sessions for ML team members, equipping them with necessary MLOps tools and technologies for efficient model management and deployment.

•Developed and maintained documentation and best practices for ML operations, ensuring knowledge sharing and smooth onboarding of new team members.

•Designed and implemented CI/CD pipelines using Jenkins and GitLab, enabling automated testing, building, and deployment of ML models for consistent and reliable delivery.

•Automated model monitoring using MLFlow and Weights and Biases.

•Conducted performance tuning and optimization of ML models with AWS SageMaker and Lambda, significantly improving inference speed and cost efficiency.

•Developed a containerized ML model deployment solution using Docker and Kubernetes for efficient resource utilization and seamless scalability of high-volume inference requests.

•Architected a robust and scalable infrastructure using AWS Elastic Kubernetes Service (EKS) and Docker containers to efficiently manage ML model deployment, incorporating fault-tolerant mechanisms for continuous availability and optimized resource allocation.

•Collaborated with DevOps teams to integrate ML workflows into existing CI/CD pipelines for seamless deployment and version control of ML models.

•Designed and implemented scalable data pipelines using AWS Glue and Athena for seamless data ingestion, transformation, and storage, enhancing data-driven decision-making through efficient and reliable data processing and analysis.

Senior Data Scientist with Truist Financial Corporation, Charlotte, North Carolina

Feb 2016 – May 2018

As a Sr. Data Scientist with Truist Financial Corporation, I collaborated with the Cyber Security and Fraud departments to enhance credit/debit card fraud detection using machine learning algorithms. My main role was to design a data-driven model that accurately detected fraud cases while minimizing false positives. Employing statistical methods, I addressed the challenge of highly imbalanced datasets by implementing innovative solutions to ensure high confidence in predictions and outlier detection.

Responsibilities:

•Conducted data cleaning, feature scaling, and feature engineering using pandas and NumPy in Python, and developed models with deep learning frameworks.

•Utilized Scikit-Learn’s model selection framework for hyper-parameter tuning with Grid Search CV and Randomized Search CV algorithms.

•Addressed data imbalance by stratifying datasets to ensure fair representation of minority classes for cross-validation.

•Developed a fault detection model using logistic regression, SVM, Random Forest, XGBoost, and ADABoost, achieving 92% accuracy on test data.

•Accessed production SQL databases to extract and validate data against third-party systems.

•Worked with large datasets (10M+ observations of text data), performing data cleaning and exploratory data analysis (EDA) using techniques like bag of words, K-means, and DBSCAN.

•Leveraged Scikit-Learn, SciPy, Matplotlib, and Plotly for EDA and data visualization.

•Consulted with regulatory and subject matter experts to understand data streams and variables.

•Developed artificial neural network models to detect anomalies and fraud in transaction data.

•Conducted data mining and built statistical models in Python to provide actionable insights to business executives.

•Deployed models using a Flask API within a Docker container.

•Assessed model performance with metrics such as confusion matrix, accuracy, recall, precision, and F1 score, with particular focus on recall.

•Used Git for version control on GitHub to collaborate with team members.

•Created dashboards with Tableau and provided detailed reports, including summaries, charts, and graphs for team and stakeholders.

•Collaborated with Data Engineers on database design for Data Science projects.

•Developed unsupervised K-Means and Gaussian Mixture Models (GMM) from scratch in NumPy to detect anomalies.

•Employed a heterogeneous stacked ensemble of methods for final decisions on fraudulent transactions.

•Implemented a Python-based distributed random forest.

•Used predictive modeling tools in SAS, SPSS, R, and Python.

Senior Data Scientist with Southwest Airlines, Dallas, Texas

Mar 2014 – Jan 2016

As a Senior Data Scientist at Southwest Airlines, I resolved optimization problems in OTP, crew scheduling, and gates optimization. I performed parameter tuning using grid search, feature selection, and model evaluation in various scenarios. I conducted rapid data science experiments with datasets in the range of GBs to TBs, improving merchant prediction by 2% and enhancing transaction-level geo detail population by 11%. Additionally, I developed modules processing millions of transactions daily, significantly reducing costs and ensuring optimal performance through proactive monitoring and anomaly detection algorithms.

Responsibilities:

•Resolved optimization challenges in OTP, crew scheduling, and gate management.

•Conducted parameter tuning using grid search, feature selection, and model evaluation across various scenarios.

•Conducted rapid Data Science experiments resulting in products using datasets ranging from GBs to TBs. Enhanced merchant prediction by 2% through innovative adaptations of open-source software.

•Improved granularity of transaction-level geo details by 11% with a data-driven module.

•Developed a merchant prediction module processing approximately 1 million transactions daily and a geo module processing around 8.8 million transactions daily.

•Achieved a 4x reduction in costs for the 'Store ID Identification' module, resulting in savings of $60 per day or $30,000 annually.

•Applied clustering-based outlier detection algorithms like CBLOF and Angle-Based Outlier Detectors to identify anomalies in medicine usage patterns.

•Built and maintained data pipelines to ensure timely and accurate data ingestion for anomaly detection.

•Implemented proactive monitoring and maintenance protocols to optimize the performance and effectiveness of deployed machine learning models.

•Designed and deployed machine learning models utilizing advanced anomaly detection algorithms such as Isolation Forest, Local Outlier Factor, and One-class Support Vector Machine to detect fraudulent claims and anomalies in medicine usage.

•Conducted thorough model evaluation and validation using precision, recall, and F1-score metrics to ensure the robustness and efficacy of anomaly detection algorithms.

•Collaborated extensively with subject matter experts to integrate domain knowledge into anomaly detection model development and implementation.

•Engaged in comprehensive feature engineering to extract meaningful features from medical claims data, enhancing model performance.

•Utilized the Histogram-based Outlier Score algorithm to develop a reliable system for accurately detecting and flagging fraudulent claims and early signs of medication abuse.

•Employed convolutional autoencoders, a deep learning technique, for precise and efficient anomaly detection in detecting fraudulent claims and identifying medication abuse early on.

Data Scientist with Berkshire Hathway, Omaha, Nebraska

Jan 2012 – Mar 2014

As a Data Scientist at Berkshire Hathway, I specialized in model development, constructing churn analysis models, performing market segmentation, and estimating customer lifetime value. Additionally, I extracted insights from existing datasets and prepared data for further analysis as part of a collaborative team effort.

Responsibilities:

•Conducted sentiment analysis on customer feedback to identify satisfaction drivers and implemented targeted improvements.

•Developed advanced customer segmentation algorithms to optimize marketing budget allocation, achieving a 20% reduction in expenditures.

•Designed and implemented robust data pipelines using SQL, Python, and Hadoop for reliable data analysis.

•Executed A/B testing to optimize marketing campaigns, resulting in a 15% improvement in ROI.

•Collaborated on recommendation systems to enhance customer experiences and increase upsell opportunities.

•Led cross-functional teams in customer segmentation analysis, boosting engagement by 25%.

•Stayed updated on data science and machine learning advancements for enhanced analytical capabilities.

•Implemented clear data visualization techniques to present insights effectively.

•Applied natural language processing for customer feedback analysis and product improvement.

•Conducted market segmentation to tailor strategies based on demographic, behavioral, and psychographic data.

•Presented actionable insights to senior executives based on comprehensive data analysis.

•Utilized A/B testing to assess marketing campaign impact and provide data-driven recommendations.

•Collaborated across teams to define project objectives, gather data, and develop analytical solutions.

•Built models to estimate customer lifetime value for strategic decision-making.

Data Analyst with Absolute Data, New York City, NY

Nov 2009 – Dec 2011

•Gather and preprocess large datasets, ensuring data quality & integrity using tools like Python (pandas, NumPy), SQL.

•Perform exploratory data analysis (EDA) to uncover insights, trends, and patterns using statistical techniques and visualization tools such as Tableau and Power BI.

•Develop and present comprehensive reports and dashboards to stakeholders, translating complex data findings into actionable business recommendations.

•Work closely with cross-functional teams, including marketing, finance, and operations, to support data-driven decision-making.

•Utilize machine learning models for predictive analysis and forecasting, enhancing business strategies and performance.

•Expertise in tools such as Python, R, SQL, Tableau, and Power BI for effective data analysis and visualization.

•Stay updated with the latest data analytics trends and technologies to continuously improve data processes and methodologies.

Data Analyst with Indium Software, Princeton, NJ

Jan 2007 – Oct 2009

•Create and present comprehensive reports and interactive dashboards using tools like Tableau and Power BI.

•Collect, clean, and maintain large datasets ensuring accuracy and integrity using SQL, Python (pandas, NumPy).

•Conduct detailed data analysis and statistical modeling to identify trends, patterns, and insights.

•Partner with cross-functional teams to support business objectives with data-driven insights.

•Utilize advanced analytical tools and stay updated with the latest industry trends to improve data analysis processes.

EDUCATION

Master’s degree in Operational Research Engineering

University of Bejaia, Bejaia, Algeria

LICENSES & CERTIFICATIONS

•Microsoft Certified: Data Analyst Associate

•Microsoft Certified: Power Platform Associate

•Microsoft Certified: AI Engineer

•Microsoft Certified: Azure Data Engineer

Contact this candidate