Post Job Free

Resume

Sign in

Data Scientist

Location:
San Jose, CA
Posted:
January 09, 2024

Contact this candidate

Resume:

Arka Bagchi

*** ******* ******, *** ****, CA, ***39 • 408-***-**** • ad2lqh@r.postjobfree.com • https://www.linkedin.com/in/arka-bagchi-381a111b4/ • https://github.com/arkabag

PROFESSIONAL PROFILE

Data Scientist with 6 years of experience in data analysis, specializing in classical ML modeling and advanced NLP techniques in finance using large language model frameworks. Proficient in Python, SQL, Tableau, and various data science libraries, with a proven track record in extracting insights from complex financial and legal documents using AI language model pipelines. Seeking a dynamic role to apply my expertise in NLP and ML, contributing to impactful data-driven decisions.

TECHNICAL SKILLS

DBMS: Relational: SQL, MySQL, PostgreSQL, Clickhouse NoSQL: MongoDB Vectorstores: PGVector, Chroma, Pinecone, Weviate

Analytical Tools: SQL, Python, PySpark, Databricks, Tableau, PowerBI, Google Analytics, Advanced Excel, SAS.

Data Science: Data Wrangling, Statistical Modeling, Predictive Analytics, Data Visualization, NLP for Summarization Models, Recommender Systems, Time Series Analysis

Machine Learning: Retrieval Augmented Generation (RAG), LLMOps, LLM Frameworks, Random Forest, SVM, Recommender Systems, Naïve Bayes, kNN, Gradient Boosting, Customer Segmentation With Clustering, Prompt Engineering, Named-Entity Recognition (NER)

EDUCATION AND CERTIFICATION

Data Science Career Track, Springboard

500+ hours of mentor-led professional development with hands-on application in Python, SQL, Spark, machine learning, deep learning, inferential statistics, data visualization, and data storytelling

Bachelor of Science, Psychology (Emphasis on Statistics), Washington State University, Pullman, WA - 2020

Relevant Courses: Advanced Statistics with Calculus, Applied Data Science with Python, SQL for Data Science

Honors: Magna Cum Laude

RELEVANT EXPERIENCE

Data Scientist, MCI Private Equity, San Jose, CA Jan 2022 - Present

•Implemented RAG pipelines with Python LLM frameworks LangChain and LlamaIndex, and various vector databases, primarily Pinecone, Weaviate, and pgvector (postgreSQL extension)

•RAG pipelines began from loading documents with Python LLM framework document loaders from LangChain and LlamaIndex, then used semantic routing or other dynamic chunking models, then stored embedded chunks in a vector database, and ran queries against embedded chunks for prompt injection

• LLM--based text summarization, increasing textual data extraction from regulatory documents accuracy substantially, contributing to a 30 basis point increase in IRR for the fund.

•Implemented Python based LLMOps pipelines with Weaviate specifically tailored for SEC filings and real estate data, significantly improving data granularity and analysis precision

•Operated and delivered RAG (LLM) pipelines through Weviate and LangChain (Python) for extracting structured data or salient textual data in SMID cap healthcare equity filings

•Advanced the use of PostgreSQL in tandem with PGVector, a vector database extension, for performing cosine similarity measures on embedded chunks of textual data. This approach enabled precise identification and injection of the most relevant embedded chunks into LLM frameworks for in-depth analysis

•With Python ML pipelines through Scikit-Learn, Optimized a Gradient Boosting Model (LightGBM) for Net Present Value (NPV) backtesting prediction, improving the model's F1 score by 5% and accuracy by 4%, ensuring robustness through careful AUC/ROC analysis.

•A/B tested fine tuning strategies for embedding models to determine best vector representation of nuanced financial or legal text chunks

Data Analyst, Kraw Law Group, Mountain View, CA Dec 2020 - Dec 2021

•Developed and managed text classification pipelines for trust fund legal documents

•Implemented a bespoke Python NLP pipeline integrating early-stage BERT models for automatic classification and summarization of legal documents, enhancing the efficiency of ERISA fund filings and contributing to a 30% increase in document processing speed.

•Spearheaded the Named Entity Recognition (NER) project for legal documents, utilizing state-of-the-art transformer models like RoBERTa and fine-tuning them for the legal domain, achieving an 85% accuracy in entity extraction and significantly aiding in compliance tracking.

Patent Database Manager, Office of Commercialization, Spokane, WA July 2017 - Dec 2020

•Managed patent prosecution database on Inteum

•Developed pipeline from local patent portfolio to Named Entity Recognition (NER) models

Research Analyst, WSU Cognition Lab, Pullman, WA July 2017 - Dec 2020

•Conducted data analysis using IBM SPSS, focusing on clinical research datasets.

•Engaged in data preparation and mining for postgraduate research, utilizing IBM SPSS Modeler for complex queries.

FEATURED GITHUB PROJECT SHOWCASE

Airbnb Review Summarizer for Investors (Nov 2023 - Present): Developed a RAG pipeline for investment analysis of short-term rentals in Joshua Tree NP. Techniques include processing Airbnb review data in PostgreSQL, summarization with PGVector, and LLM-based real estate analysis.

LLM Frameworks for SEC Filing Equity Analysis (Nov 2023 - Present): Created NLP pipelines for QA, fact extraction, and analysis over biotech/healthcare SEC filings. Involved tracking key financial metrics and developing Python-based analytical tools. Full RAG pipeline — from loading the data, to chunking, embedding the chunks, top-k retrieval with similarity measures on vector database, injecting top-k chunks into an LLM prompt, then to ultimate document QA with the generative AI LLM model

Synthetic Journal Data Generator (Nov 2023 - Present): Implemented LangChain and LlamaIndex for generating synthetic journal data. Utilized Pydantic for output parsing and JSON schema querying in AI models. GitHub

Airbnb Occupancy Modeling (Sep 2023 - Dec 2023): Executed a machine learning project for predicting Airbnb occupancy rates using Python and PostgreSQL, focusing on data from Joshua Tree National Park. GitHub



Contact this candidate