Machine Learning Data Scientist

Location:

Harrison, NJ

Posted:

June 20, 2025

Contact this candidate

Resume:

Dileep Vinay Bezawada

Data Scientist

EDUCATION:

Stevens Institute of Technology,

Master of Applied Artificial Intelligence

CONTACT:

CERTIFICATIONS:

+1-954-***-****

******************@****.***

Experienced Data Scientist with 5+ years of expertise in applying advanced analytical techniques, machine learning models, and statistical methods to solve complex business problems. Adept at leveraging large data sets, data mining, and predictive modelling to deliver actionable insights. PROFILE SUMMARY:

PROFILE SUMMARY:

• 5+ years of experience in data science, including machine learning, statistical modelling, and advanced analytics and expertise in Python, R, SQL, and Scala for data manipulation, analysis, and model development.

• Proficient in machine learning libraries such as TensorFlow, Keras, Scikit- learn, and PyTorch to build predictive models, classification, regression, and clustering solutions and strong knowledge of statistical techniques like hypothesis testing, A/B testing, and time series analysis.

• Skilled in data preprocessing, feature engineering, and ETL processes using tools like Pandas, NumPy, and Apache Spark.

• Experience working with big data technologies including Hadoop, Hive, and AWS EMR to process large-scale datasets.

• Adept at data visualization using Tableau, Power BI, Matplotlib, and Seaborn for creating dashboards and reports.

• Hands-on experience with Natural Language Processing (NLP) tools like NLTK, SpaCy, and implementing text mining, sentiment analysis, and topic modelling and knowledge of deep learning algorithms, including neural networks and CNNs, using TensorFlow and Keras.

• Experienced with cloud platforms like AWS, GCP, and Azure for deploying machine learning models and managing data pipelines.

• Strong proficiency in SQL for writing complex queries and performing data extraction from relational databases like MySQL, PostgreSQL, and Oracle.

• Familiarity with big data ecosystems and data warehousing solutions including Redshift, Snowflake, and Big Query.

• Expertise in model deployment and monitoring using MLOps practices, including Docker, Kubernetes, and CI/CD pipelines.

• Excellent communication skills with the ability to present technical findings to both technical and non-technical stakeholders.

• Proficient in machine learning technologies such as Scikit-learn, TensorFlow, Keras, PyTorch, and XGBoost for building, training, and deploying predictive models, including classification, regression, clustering, and deep learning algorithms and knowledge on time series analysis using AR, MA, ARIMA, GARCH and ARCH model.

• Capable of developing advanced data visualizations and dashboards using Tableau, Power BI, and Python libraries like Matplotlib and Plotly to present findings clearly and skilled in using NoSQL databases such as MongoDB, Cassandra, and Elasticsearch for handling unstructured data and high-volume real-time data processing and familiar with containerization technologies like Docker and Kubernetes for deploying and scaling machine learning models in production environments. TECHNICAL SKILLS:

• Programming Languages: Python, R, SQL, Scala, Java

• Machine Learning Libraries: Scikit-learn, TensorFlow, Keras, PyTorch, XGBoost, LightGBM

• Big Data Technologies: Hadoop, Apache Spark, HDFS, Hive, Pig, MapReduce

• Data Visualization Tools: Tableau, Power BI, Matplotlib, Seaborn, Plotly

• Statistical Analysis: Hypothesis testing, A/B testing, regression analysis, time series forecasting

• Data Processing & ETL: Pandas, NumPy, Dask, Apache NiFi, Alteryx

• Cloud Platforms: AWS (S3, EC2, EMR, Sage Maker), GCP (Big Query, Dataflow), Azure (Data Lake, Machine Learning)

• Databases: MySQL, PostgreSQL, MongoDB, Oracle, Redis

• Big Data Warehousing: Amazon Redshift, Google Big Query, Snowflake, Azure Synapse

• Natural Language Processing (NLP): NLTK, SpaCy, BERT, Word2Vec, Genism

• Version Control: Git, GitHub, Bitbucket

• Model Deployment & MLOps: Docker, Kubernetes, Jenkins, Flask, Fast API, MLflow

• Containerization: Docker, Kubernetes, and AWS Fargate

• Data Pipeline/Orchestration: Apache Airflow, Luigi, Kafka, and AWS Glue

• NLP Tools: NLTK, spaCy, Transformers, and BERT

• Automation & Scripting: Bash, PowerShell, and Python scripting. WORK EXPERIENCE:

Selective Insurance Data Scientist

Branchville, New Jersey Dec 2024 – Present

Description: Selective Insurance Group, Inc. is a leading property and casualty insurance holding company. Responsible to extracting data from multiple sources, using machine learning tools to organize data, process, clean, and validate the data, analyse the data for information and patterns, develop prediction systems and using statistical techniques and visualization tools to understand the data and using data science techniques to identify fraudulent activities. Responsibilities:

• Built machine learning models using Random Forest, Gradient Boosting, and Logistic Regression to predict claim frequency, policy lapse risk, and fraud likelihood, improving underwriting precision and reducing operational losses.

• Engineered policyholder risk features using Python (Pandas, NumPy) and performed advanced data wrangling to support actuarial pricing models and retention strategy optimization.

• Utilized AWS Redshift, Snowflake, and SQL to extract and analyse multi-line policy, claims, and customer data, driving insights into loss trends and improving premium pricing models.

• Automated claims data workflows through Apache Spark and Apache Airflow, enabling timely actuarial reporting and faster fraud investigation cycles and designed insurance operations dashboards in Tableau and Power BI to visualize KPIs like loss ratio, expense ratio, and combined ratio for underwriting performance evaluation.

• Applied NLP methods using spaCy and NLTK to analyse adjuster notes and customer service transcripts, identifying root causes of delays and dissatisfaction in claims processing.

• Deployed fraud detection models via AWS SageMaker, integrating them into Selective’s claims platform for real-time scoring of suspicious activities during first notice of loss (FNOL) and developed automated reporting pipelines and interactive underwriter reports using Jupiter Notebooks and Python, increasing actuarial team productivity and accuracy.

• Built deep learning models using TensorFlow and PyTorch to predict large-loss events, litigation probability, and subrogation potential from unstructured data sources.

• Created seamless ETL workflows with Informatica and Talend to ingest data from third-party reinsurers, catastrophe models, and ISO feeds for regulatory compliance and exposure analysis.

• Leveraged Hadoop (HDFS, Hive) to process long-tail claims data, enabling granular trend analysis across commercial auto, property, and workers’ compensation insurance lines and developed and deployed Flask-based internal APIs and RESTful services to serve underwriting and claims risk scores to internal web applications in real-time.

• Utilized GCP Big Query and Dataflow to build scalable actuarial data marts and enable high-speed queries for loss development factor (LDF) calculations and reserving analytics.

• Employed Docker and Kubernetes to containerize actuarial toolkits and pricing engines, ensuring reliable and scalable access across underwriting regions and conducted advanced product performance assessments using Alteryx, integrating quote-to-bind data and claims severity metrics to support data-driven product enhancements.

• Used Neo4j graph analytics to model agent-policyholder-claim networks, uncovering collusive fraud rings and reducing suspicious claim payouts and built custom catastrophe risk scoring engines using SAS and MATLAB, incorporating geographic, structural, and historical claim data to improve catastrophe underwriting discipline. Environment: Python, Pandas, NumPy, SQL databases, Apache Spark, Airflow, Tableau, Power BI, NLTK, spacy, AWS Sage Maker, machine learning algorithms, TensorFlow, PyTorch, Informatica, Talend, Hadoop, Hive, HDFS, RESTful services, Flask, Big Query, Dataflow, Docker, Kubernetes, Alteryx, Neo4j, Tesseract, AI workflows, SAS, MATLAB. Munich Reinsurance America Data Scientist

Princeton, New Jersey, USA Apr 2024 – Nov 2024

Description: Munich Reinsurance America, Inc. is a premier property & casualty reinsurance provider Offers advanced solutions in catastrophe risk, specialty reinsurance, and customized risk transfer. Developed and trained the machine learning models to predict future risks, claims, and other important metrics and develop and refine risk models to assess the potential impact of various events on the reinsurance portfolio. Responsibilities:

• Developed predictive models using Random Forest, Logistic Regression, and XGBoost to estimate claim severity, litigation risk, and reinsurance exposure, enhancing underwriting accuracy and portfolio management.

• Conducted advanced data analysis with Python, R, and Tableau to identify loss trends across commercial auto and catastrophe lines, enabling risk-adjusted pricing strategies, implemented real-time catastrophe tracking and claim intake systems using Kafka and Spark Streaming, enabling faster response to events like hurricanes, wildfires, and severe storms.

• Utilized SQL and NoSQL databases to process historical claims, reinsurer treaties, and market data for regulatory reporting and reserving analytics and automated treaty data workflows using Apache Airflow and Talend, improving the efficiency of bordereaux ingestion, risk aggregation, and treaty compliance checks.

• Applied NLP with spaCy and NLTK to extract policy details, exclusions, and sublimits from insurance contracts and legal documents for treaty validation and performed time-series modelling with ARIMA and LSTM to forecast reserve development, incurred but not reported (IBNR) claims, and seasonal claim spikes.

• Ensured compliance with reinsurance regulatory standards and internal governance policies (e.g., SOX, NAIC) through automated reporting workflows and structured audit trails and built dynamic dashboards using Power BI, Tableau, and Excel Macros to track combined ratio, loss ratio, and net exposure across ceded and assumed reinsurance programs.

• Used SAS, Python, and R for actuarial analysis, large loss triangle modelling, and reinsurer performance benchmarking.

• Developed CI/CD pipelines using Docker, Kubernetes, and Jenkins to deploy actuarial pricing engines and exposure models across staging and production environments and tuned and monitored model performance using GridSearchCV and Bayesian Optimization, optimizing predictive accuracy for property risk models.

• Architected reinsurance data marts and analytical platforms on Snowflake and Amazon Redshift to support loss development analytics, risk aggregation, and regulatory submissions and utilized Elasticsearch and Splunk to monitor log activity of pricing engines and policy systems, enabling real-time anomaly detection and system diagnostics.

• Built and maintained internal APIs using Flask and Fast API to deliver real-time risk scores and reinsurer treaty details to internal underwriting platforms and explored blockchain for policy issuance validation and treaty agreement transparency across multiple reinsurers and retrocession Aires and applied TensorFlow and PyTorch to develop models that classify catastrophe images for claims triage and automate damage estimation in large-scale disasters. Environment: Python, Tableau, logistic regression, random forest, XGBoost, SQL, ETL tools, Apache Airflow, Talend, Kafka, Spark Streaming, NLP, Excel macros, Power BI, S3, Redshift, Big Query, Docker, Kubernetes, Jenkins, GridSearchCV, Bayesian, Snowflake, Flask, Fast API, Elasticsearch, Splunk. Glenmark Pharmaceuticals Data Analyst/ Data Scientist Bangalore, India Aug 2021 – July 2023

Description: Glenmark Pharmaceuticals Ltd. is a research-driven global pharmaceutical company focuses on developing and marketing branded generics, specialty drugs, and over-the-counter (OTC) products. Analysed large datasets from various sources, including clinical trials, molecular data, and market research, to identify patterns and trends. Responsibilities:

• Extracted, processed, and analysed large-scale clinical and pharmaceutical datasets using SQL, Python, and R to identify treatment efficacy trends and optimize drug development strategies.

• Developed and deployed machine learning models using TensorFlow, PyTorch, and scikit-learn for drug response prediction, adverse event detection, and clinical trial outcome forecasting.

• Utilized Hadoop, Apache Spark, and Hive to process genomics and patient health data, enabling scalable analytics for drug efficacy studies and personalized medicine initiatives.

• Built interactive visualizations and dashboards using Tableau, Power BI, and Matplotlib to monitor key performance indicators in drug trials and pharmaceutical manufacturing processes.

• Optimized clinical data storage and access workflows using Snowflake, Amazon Redshift, and Google Big Query, ensuring rapid querying and reporting for global regulatory submissions.

• Applied Natural Language Processing (NLP) techniques with spaCy, NLTK, and Transformers to extract insights from medical literature, trial reports, and regulatory documents.

• Managed unstructured R&D data using MongoDB, Cassandra, and Elasticsearch, facilitating secure access to research notes, compound libraries, and experimental outcomes.

• Used Git, GitHub, and Bitbucket for version control and implemented CI/CD pipelines with Jenkins, GitLab CI/CD, and CircleCI to automate deployment of data science applications.

• Conducted text mining and sentiment analysis using Text Blob, Gensim, and BERT to assess scientific publications, patient feedback, and social discourse related to Glenmark's drug portfolio. Environment: SQL, Python, R, TensorFlow, PyTorch, scikit-learn, Hadoop, Spark, Hive, Tableau, Power BI, Matplotlib, Snowflake, Redshift, Big Query, spaCy, NLTK, PostGIS, ArcGIS, MongoDB, Cassandra, Elasticsearch, NoSQL databases, Jenkins, GitLab CI/CD, CircleCI, Text Blob, Genism.

Baxter International Data Analyst

Bangalore, India June 2019 – July 2021

Description: Baxter International Inc. is a global healthcare company that develops, manufactures, and markets a broad range of essential healthcare products. Supported the operational efficiency of the Baxter Department by coordinating fiscal management activities, performing data analysis, and maintaining departmental statistics and reports. Responsibilities:

• Analysed healthcare data, including patient admission rates, treatment effectiveness, and drug usage, using statistical analysis with R and Python (Scikit-learn, Stats models) to improve clinical decision-making.

• Leveraged big data technologies like Apache Spark and Hadoop to analyze high-volume healthcare datasets, identifying patterns in patient care and operational inefficiencies.

• Designed and implemented data models for healthcare analytics using Snowflake, Azure Synapse Analytics, and Google Big Query to support business intelligence and clinical decision support.

• Leveraged Matplotlib and Seaborn in Python for creating detailed visualizations of healthcare trends, including patient demographics, treatment outcomes, and operational performance metrics.

• Ensured CI/CD pipeline integration for healthcare data solutions using Jenkins, GitLab CI/CD, and Azure DevOps, enabling seamless deployment of analytics applications in clinical environments.

• Implemented deep learning models using TensorFlow and Keras to predict disease progression, identify potential health risks, and optimize treatment plans based on patient data.

• Developed and deployed real-time monitoring systems for patient vitals and medical equipment using Apache Kafka and Apache Flink, ensuring timely alerts for critical conditions.

• Applied machine learning techniques such as XGBoost and LightGBM for predicting patient outcomes, optimizing hospital resource allocation, and detecting anomalies in medical data.

• Engineered ETL pipelines using Apache NiFi and Talend to automate ingestion and transformation of heterogeneous healthcare data from EHR and medical devices.

Environment: Python, Scikit-learn, Stats, Apache Spark, Hadoop, Snowflake, Azure Synapse Analytics, Google Big Query, Matplotlib, Seaborn, Python Jenkins, GitLab CI/CD, Apache Kafka, Apache Flink, XGBoost, LightGBM, Apache NiFi, Talend.

Contact this candidate