Arslan Erden
San Francisco Bay Area 530-***-**** ******.***@*****.***
SUMMARY
I am a fifth-year PhD candidate in Statistics at Florida State University and expecting to graduate by Spring 2023. I have 5 years of experience in answering research questions by analyzing both structured and unstructured data with techniques such as Machine Learning, Deep Learning, Statistical Analysis, and Natural Language Processing. My research interests focus on Information Extraction, Information Retrieval, and Text Mining. PROFESSIONAL EXPERIENCE
CARTA HEALTHCARE, Clinical Data Science Intern June 2022 – August 2022
- Added features to the existing NLP pipeline (OCR feature with tika, updating some flags, etc.)
- Built and released new datasets (MTSample dataset, NLPEvent dataset, OCREvent dataset, etc.)
- Implemented NLP pipeline profiling analysis and improved the speed by approximately 15% FLORIDA STATE UNIVERSITY, Research Assistant August 2020 – May 2022
- Analyzed OneFlorida Alzheimer’s Disease (EHR) data (13 tables, each table with more than 1M rows)
- Trained, experimented with different Transformer-based NER models on Clinical Trial Eligibility Criteria
- Web scraped and annotated PrEP/Truvada (HIV) related tweets, implemented topic modeling, UMLS semantic type analysis to understand public discussions/interested topics about HIV prevention on social media
- Implemented descriptive, association and clustering analysis on COVID-19 clinical studies in ClinicalTrials.gov COMPETITIONS, PROJECTS and RESEARCH
BioCreative VII Challenge Track 2 hosted by NIH: First Place
- Extracted chemical mentions in PubMed full-text articles with the third best F1 score among all the competitors
- Achieved the second best F1 score on chemical normalization and the best F1 score on chemical indexing Litcoin NLP Challenge hosted by NASA Tournament Lab: First Place
- Identified biomedical concept mentions within a research paper with the highest Jaccard score among all the competitors
- Extracted all the relationships between biomedical concepts with the highest Jaccard score among all the competitors Disease Risk Factor Extraction from Medline Database
- Curated a sentence-level corpus from PubMed database and manually annotated the corpus
- Built up and fine-tuned a pre-trained BERT (PyTorch) model for text classification
- Extracted risk factors for diseases in large scale Transformer-Based Named Entity Recognition for Parsing Clinical Trial Eligibility Criteria (Accepted by ACM-BCB)
- Evaluated 4 state-of-the-art Transformer-based NER models and their variants on 2 publicly available clinical trial eligibility criteria datasets. We found out RoBERTa-MIMIC-Trial model outperformed other models with F1 score of 0.9157.
- Our findings will advance SOTA in the development of a reliable NLP-assisted pipeline for automated electronic screening. EDUCATION
FLORIDA STATE UNIVERSITY Tallahassee, FL, USA
Doctor of Philosophy in Statistics Expected May 2023 Master of Science in Statistics August 2018 – May 2020 BEIJING INSTITUTE OF TECHNOLOGY Beijing, China
Bachelor of Aerospace Engineering September 2007 – June 2011 SKILLS
Familiar with programming languages: Python, SQL, R, and MATLAB
Familiar with tools: Jupyter Notebooks, Linux, Docker, GitHub, Bitbucket
Have some experience with programming languages: JavaScript, PHP, HTML, CSS, Ajax
Have a great understanding of computer science basics: Algorithms and data structures
Passionate about NLP, Text Mining, Information Retrieval and familiar with: BERT, Hugging Face, Spacy, NLTK, StanfordNLP, Pubtator, UMLS, QuickUMLS, Text Annotation
Familiar with Machine Learning and Deep Learning tools: PyTorch, TensorFlow, Keras, Scikit-learn, Numpy, Pandas
Have been exposed to tools: DVC, AWS S3 (MinIO), Jenkins, REST API, Elastic Search, Tableau, Postgres