Machine Learning Data Scientist

Location:

United States

Posted:

December 30, 2024

Contact this candidate

Resume:

Pranay Kumar

Sr. Data scientist

***********************@*****.***

469-***-**** www.linkedin.com/in/pranay-a-726a22276

Professional Summary:

oDecade+ years of experience in Data Science, specializing in machine learning (supervised, unsupervised), deep learning, and AI-driven solutions.

oProven ability to design and deploy machine learning models using tools like Scikit-learn, TensorFlow, and PyTorch, with extensive expertise in Python libraries such as NumPy, Pandas, Seaborn, and Matplotlib.

oExpertise in problem-solving, use case formulation, data gathering, exploratory data analysis, preprocessing, and model development, driving innovative solutions across projects.

oStrong experience in cloud-based architectures and AWS services (Lambda, S3, DynamoDB), with hands-on experience building web applications in cloud environments.

oLeadership in Gen AI solutions, orchestrating the design and implementation of AI-powered models to optimize business performance and enhance customer engagement.

oSkilled in data visualization using PowerBI, Tableau, and Matplotlib, providing actionable insights to stakeholders through dashboard creation and data storytelling.

oExperience in Natural Language Processing (NLP) and Gen AI models, leveraging FAISS, Quadrant_db, and LLM models to improve customer service with sentiment prediction.

oProficient in extracting and transforming unstructured data into structured formats using libraries like PyPDF, PdfMiner, and Layout Parser.

oDemonstrated success in cross-functional collaboration, leading teams of data scientists, product managers, and business analysts to deliver data-driven solutions aligned with business objectives.

oSkilled in data infrastructure management, ensuring accuracy, consistency, and scalability of data pipelines, along with continuous process improvement initiatives.

oAdept at utilizing version control (Git, GitHub) and issue tracking tools (Jira) for seamless project management and collaboration.

Achievements and Key Contributions

oLed AI-driven document processing project, improving accuracy by 30% across multiple use cases.

oSuccessfully delivered a healthcare chatbot project, improving policy query response time by 20%.

oImplemented advanced machine learning models for price optimization, increasing promotional effective.

Technical Skills:

oProgramming Languages: Python, SQL, R

oData Science & Machine Learning Frameworks: NumPy, Pandas, Data Frame, Scikit-Learn,

TensorFlow, PyTorch, NLTK, SciPy, ML flow, ONNX

oCloud Platforms & Services: AWS (S3, Lambda, DynamoDB, Event Bridge, Step Functions, OpenSearch, SQS, SNS, API Gateway), Azure Data Lake, Azure Databricks, GCP, Sage Maker.

oMachine Learning & Deep Learning Methodologies: K-means clustering, Random Forest (Classifier & Regression), SVM, Decision Trees, CNN, LSTM, Hyperparameter Tuning (Grid Search CV, Randomized Search CV), XGBoost, Segmentation, Logistic Regression, Bayesian statistics, Forecasting, BERT, RAG, NLP (Natural Language Processing), Lang Chain, Hugging Face, ChromaDB, Machine Learning Operations & Optimization (Gurobi, PuLP), Kubeflow.

oGeneral AI & Gen AI Technologies: LLM (Large Language Models), Gen AI RAG Models,

Conversational AI Agents, AWS AI services, LLM Chatbots, Gen AI APIs, FLASK/REST API

oCI/CD Pipelines & API Development: Proficient in building CI/CD pipelines and API development using version control systems

oDatabases & Data Warehousing: MySQL, Microsoft SQL Server, PL/SQL, Azure Data Lake, Azure Databricks

oData Visualization & Business Intelligence: Matplotlib, Seaborn, PowerBI, Tableau

oVersion Control Systems: Git, GitHub, SVN, Bitbucket

oIDEs & Development Tools: PyCharm, Jupyter, Anaconda Prompt, PySpark, VSCode

oTesting Tools & Methodologies: Selenium, Functional Testing, Database Testing, Web Service Testing, SQL Server Management Studio, TestNG, Postman, REST Client, Beyond Compare

oMonitoring & Collaboration Tools: Jira, Datadog, Grafana

oOperating Systems: Windows, Linux

Educational Qualification:

oBachelors in Electrical and Electronics Engineer, JNTUH, India - 2012

Professional Experience:

Client: Findability Sciences (Boston, MA), July 2023 – Till date

Title: NLP Engineer

Project-9: Intelligent Document Processing and Section-wise Text Analysis

Business Objective: Efficiently extract and analyze text from uploaded documents, organize it into templates, and provide concise summaries for accurate, actionable information.

Responsibilities:

oDeveloped algorithms to extract text from manually uploaded files using Python libraries like PyPDF, PyMuPDF, PdfMiner, and Pdf2image, ensuring robust and accurate text extraction from complex documents.

oIntegrated Azure AI Document Intelligence for enhanced text extraction, utilizing OCR and AI- driven models to handle intricate document structures, including scanned images and handwritten text.

oUtilized Azure Databricks for scalable data processing and analysis, optimizing workflows and enabling efficient handling of large document datasets.

oUse Azure Databricks Delta Lake to manage and store a large published document database. Update information in real time Version control and guaranteed search efficiency.

oDatabricks MLflow is used to control the lifecycle of Generative AI models, including versioning. Trial tracking and automated deployment in production environments.

oDesigned and optimized data pipelines in Azure Databricks for seamless integration of extracted data into downstream applications, ensuring data is processed in real-time and made available for immediate use.

oBuilt a scalable Databricks workflow that seamlessly integrates with Azure Blob Storage and Data Lake, enabling real-time ingestion, processing, and analysis of unstructured document data.

oLeveraged LLMs (GPT-4) and Retrieval-Augmented Generation (RAG) models for intelligent text retrieval and summarization, significantly improving the relevance and accuracy of document summaries.

oOptimized Databricks clusters for parallel processing of document extraction workflows, utilizing autoscaling and spot instances to reduce computation costs while maintaining performance.

oImplemented advanced techniques to extract images and tables from unstructured PDFs, using Pdf2image and custom image processing algorithms to accurately process visual data.

oDeveloped structured templates for organizing extracted text, ensuring data consistency and meeting stakeholder requirements through a uniform format.

oBuilt workflows for processing and analyzing extracted tables, converting them into structured formats like CSV for further analysis and reporting.

oCollaborated with cross-functional teams (data scientists, engineers) to optimize extraction and analysis processes, resulting in enhanced project efficiency and output quality.

oIntegrated Generative AI models to auto-generate summaries and insights from extracted data, providing key highlights and actionable information for stakeholders.

oEnsured accuracy and consistency of extracted data by conducting thorough validation and testing, utilizing Azure AI testing tools and performance metrics for reliability.

oDeveloped a workflow to store extracted data (text, images, tables) in a structured database, ensuring easy access for future analysis and reporting.

oImplemented error-handling mechanisms and logging systems to track performance and accuracy, enabling proactive resolution of extraction and summarization issues.

oConducted continuous testing and model refinement to optimize efficiency in production environments, ensuring reliable performance.

Project-8: Integrated Document Intelligence: Chatbot-Driven PDF Interpretation and Audit Entity Extraction

Business Objective: Revolutionize the interpretation and processing of large unstructured PDF documents using chatbot technology and IAP methodologies to enhance document comprehension, extract key entities, and ensure consistent, relevant, and accessible data.

Responsibilities:

oDeveloped algorithms for parsing unstructured PDF data, enabling accurate extraction of key entities from large document sets.

oDesigned the IAP (Intelligent Automation Process) system for comprehensive document auditing and entity extraction, ensuring data accuracy and consistency.

oCollaborated with the Chatbot Developer to integrate parsing models and methodologies, enhancing document comprehension and enabling automated auditing.

oDesigned and developed the chatbot interface, focusing on creating a seamless user experience that aligns with the document interpretation objectives.

oTested and refined the chatbot functionalities in real-world scenarios, ensuring the interface and models met performance expectations.

oTested and validated the IAP system’s entity extraction capabilities, continuously improving accuracy through model refinement.

oPerformed continuous testing and model optimization to ensure system accuracy, efficiency, and scalability in production environments.

oDocumented bugs and collaborated with developers to ensure timely resolution of issues, improving overall system reliability.

oPrepared detailed system documentation covering functionalities, integration processes, and best practices for maintaining the system.

Client: Elevance Health Care (Carelon), India, Feb 2022 – July 2023

Title: Sr. Data Scientist

Project -7: Healthcare Policy Chatbot Development

Business Objective: Develop a healthcare chatbot with question-and-answer capabilities to help employees understand guidelines and policies by converting unstructured PDF policy documents into structured data for seamless interaction.

Responsibilities:

oDeveloped and implemented a healthcare policy chatbot with advanced question-and-answer functionality, enabling employees to easily access and understand complex policy guidelines.

oUtilized NLP techniques to extract and structure data from unstructured PDF healthcare policy documents, converting them into a chatbot-friendly format.

oIntegrated Generative AI models (LLM - GPT-3.5 Turbo) and Retrieval-Augmented Generation (RAG) models to generate accurate, contextually relevant responses, enhancing the chatbot’s information retrieval capabilities.

oDesigned and implemented a FAISS vector database for efficient storage and retrieval of embeddings, improving chatbot performance when handling large volumes of policy data.

oIntegrated the chatbot with internal systems, providing employees with seamless, secure access to policy information through a unified chatbot interface.

oCollaborated closely with the healthcare policy team to ensure the chatbot accurately reflected the latest guidelines and policies, incorporating feedback to improve chatbot responses.

oConducted rigorous testing and validation of chatbot responses, ensuring compliance and accuracy in all interactions with employees.

oLed training sessions for employees, demonstrating how to effectively use the chatbot to navigate and understand complex healthcare policies.

oDocumented the entire development process, creating comprehensive user guides and technical documentation to support future maintenance and updates.

oLeveraged Python libraries (PyPDF, PyMuPDF, PdfMiner) and NLP tools to perform efficient text extraction and preprocessing, ensuring the chatbot interacted reliably with policy data.

Project-6: Promotion Analysis and List price Optimization

Business Objective: Analysis of different types of promotions and ROI. Implementation of Price elasticies of different items.

Responsibilities:

oAnalyzed historical promotions to provide executive-ready summaries on promotion ROI, incremental lift, and key insights on winning and losing promotion strategies.

oBuilt hypotheses to assess the ROI of different types of promotions (e.g., Loyalty, Always-on, Personalized Recommendations) at the store and regional level, leading to data-driven pricing strategies.

oLeveraged machine learning techniques such as Customer Lifetime Value (CLTV) and Market Basket Analysis to segment customers and optimize promotion strategies.

oCalculated price elasticities by analyzing historic price changes, identifying the most and least price-responsive items, and making recommendations for list price updates.

oPerformed extensive data analysis using SQL and Python, enabling in-depth insights into promotion effectiveness and pricing dynamics.

oLed a team of four within an Agile methodology, ensuring timely project delivery and collaboration across teams.

oMonitored and updated list prices, using price elasticity data to adjust pricing strategies and maximize promotional effectiveness.

oDemonstrated strong knowledge of promotions, including Loyalty, Always-on, and Personalized Recommendations, to inform strategic pricing decisions.

Client: USAA, San, Antonio, Texas, India, Jan 2019 – Feb 2022

Title: Machine Learning Engineer

Project-5: Detected smart drive risk factor of customers and predicted premium

Business Objective: Build a customer segmentation model using telemetric data to predict driver risk levels and develop a regression model to accurately forecast premium prices.

Responsibilities:

oCollaborated with subject matter experts (SMEs) to finalize and collect essential telemetric data features for risk factor analysis and premium prediction.

oPrepared the base dataset by cleaning and removing features with over 50% null values, ensuring data quality for accurate model building.

oConducted Exploratory Data Analysis (EDA) to provide actionable insights on driver behavior and risk segmentation.

oBuilt a binary classification model using Random Forest Classifier to classify drivers into high-risk and low-risk categories based on telemetric data.

oValidated the classification model using performance metrics such as Confusion Matrix and Accuracy to ensure reliable risk factor predictions.

oPredicted premium values for policies by applying a risk factor analysis mechanism to customer segmentation data.

oDeveloped a regression model using Random Forest Regressor to accurately predict the premium price for each customer.

oOptimized model performance through hyperparameter tuning using GridSearchCV for both the classifier and regressor models.

oEvaluated the regression model using metrics such as R, Adjusted R, RMSE, and MS, ensuring the model’s predictive accuracy and reliability.2, RMSE, MSE, MAE

Project-4: Human Emotion Responses using OpenCV, YOLO, and ResNet

Business Objective: Develop a system to analyze customer service videos, detect emotions, and enhance sentiment analysis by accurately identifying and categorizing customer emotions from video content.

Responsibilities:

oDeveloped and deployed an LSTM classification model using Keras Tokenizer to predict customer sentiment from product and service reviews.

oPre-processed customer reviews using lemmatization, stemming, and stop-word removal to clean and prepare text data for sentiment analysis.

oImplemented count vectorization and TF-IDF to create a sparse matrix, facilitating effective sentiment analysis and emotion classification.

oOptimized model performance using advanced techniques like Parameter-Efficient Fine-Tuning (PEFT) and Lora fine-tuning, improving model adaptability.

oUtilized Databricks to build a scalable pipeline for processing both textual and video data, automating the ingestion, analysis, and storage of multimodal datasets for comprehensive sentiment analysis.

oBuilt an initial Naïve Bayes sentiment classification model, evaluating its performance with a confusion matrix and accuracy metrics.

oCreated word clouds to visualize key sentiments and identify patterns from frequently used keywords in customer reviews.

oTested the LSTM model over 30 days, continuously refining it before deploying the model to production.

oEvaluated the final model using a confusion matrix after rigorous testing, ensuring robustness and accuracy in production environments.

oMonitored and updated the model by incorporating new data and re-training to maintain the accuracy and relevance of sentiment analysis.

Client: ALDI India, Jan 2016 – Jan 2019

Title: Data Scientist

Project-3: Detected emotions of customers through analysis of their reviews.

Business Objective: Build a sentiment analysis system to track customer reviews on products and services, identifying and addressing unsatisfied customers effectively.

Responsibilities:

oPre-processed customer reviews using lemmatization, stemming, and stop-word removal for sentiment analysis.

oCreated word clouds to visualize key sentiments and patterns from frequently used keywords in customer reviews.

oImplemented count vectorization and TF-IDF to generate a sparse matrix for effective sentiment analysis and emotion classification.

oBuilt a Naïve Bayes sentiment classification model, evaluating performance with a confusion matrix and accuracy metrics.

oOptimized model performance using Parameter-Efficient Fine-Tuning (PEFT) and Lora fine-tuning for better adaptability.

oDeveloped and deployed an LSTM classification model using Keras Tokenizer to predict customer sentiment from reviews.

oTested and refined the LSTM model over 30 days, validating accuracy before deploying it to production.

oEvaluated the final model’s robustness using a confusion matrix after a month of rigorous testing.

oMonitored and updated the model by incorporating new data and re-training to maintain accuracy in sentiment analysis.

Project 2: Customer Segmentation by using K-Means.

Business Objective: Build a customer segmentation model to recommend optimal products to new customers based on existing customer data, resulting in a higher-than-expected conversion rate.

Responsibilities:

oPerformed data collection and requirement gathering with SMEs to ensure accurate data for customer segmentation.

oDesigned and implemented the K-Means clustering model, using the Elbow Method to determine optimal clusters.

oAnalyzed and visualized customer data to classify variables based on behavior and characteristics, identifying key segments.

oDeveloped customer segments for CRM and social channels (e.g., Facebook, Google Ads), resulting in increased revenue per customer.

oPresented segmentation results to stakeholders, providing insights to optimize product offerings for new customers.

oMonitored model performance, achieving higher-than-expected conversion rates through effective targeting.

Client: Siemens India, Nov 2014- Jan 2016

Title: Jr. Data Scientist

Project-1: Circuit Breaker Performance Analysis and Sales Optimization

Business Objective: Analyze the performance of Siemens’ circuit breakers using time synchronization and frequency modulation analysis to enhance product accuracy, optimize performance, and develop strategies for sales growth, while providing actionable insights through data visualization.

Responsibilities:

oPerformed time synchronization and frequency modulation analysis to improve circuit breaker accuracy and optimize operations.

oDeveloped predictive models to forecast performance and enhance circuit breaker reliability.

oImplemented optimization algorithms to reduce operational time and boost product efficiency.

oAnalyzed performance metrics to identify inefficiencies and recommend targeted improvements.

oCreated data visualizations to present performance insights, enabling data-driven product enhancements.

oCollaborated with engineering teams to integrate insights into design, improving accuracy and efficiency.

oDeveloped dashboards and reports to monitor KPIs for performance and sales trends.

oAnalyzed sales data alongside performance metrics to identify strategies for increasing sales through product improvements.

oProvided time optimization insights aligned with Siemens’ goals for product efficiency and market competitiveness.

oSupported sales and marketing teams with data-driven recommendations to enhance product positioning and customer satisfaction.

Contact this candidate