Younes Karimi Email: *.******.***@*****.***
LinkedIn: YounesKarimi
Location: Washington, D.C.
EDUCATION Website: younes-karimi.github.io
Penn State University, College of Information Sciences & Technology (GPA: 3.96) Pennsylvania, USA Ph.D. in Informatics Areas: Data Science, Natural Language Processing, Privacy, Cybersecurity August 2024 State University of New York at Albany New York, USA M.Sc. in Computer Science Areas: Machine Learning, Natural Language Processing, Privacy, Security August 2020 University of Tehran Tehran, Iran
B.Sc. in Electrical Engineering Area: Telecommunications August 2017 TECHNICAL SKILLS
• Languages: Python (expert/most-used), SQL, Java, Bash, Ruby, JavaScript, C, C++, MATLAB, HTML, CSS, XML
• Machine Learning, Deep Learning and Data Science: Scikit-learn, PyTorch, TensorFlow, Keras, Huggingface, PySpark, NumPy, SciPy, Pandas, Multithreading, Multiprocessing, Airflow, Matplotlib, D3.js, Jupyter, NetworkX, BentoML, CTranslate2
• NLP & Computer Vision: Transformers, RAG, FAISS, Elasticsearch, Qdrant, LangChain, LlamaIndex, Chroma, OpenAI GPT, Meta Llama, Google Gemini, DSPy, T5, BERT, DeepL, MADLAD, CoEdIT, GTE, BGE-M3, Multilingual-E5, FLAIR, PAWS, MPNet, MiniLM, fastText, GloVe, Word2Vec, NLTK, Gensim, spaCy, TextBlob, CoreNLP, WordNet, SimpleNLG, OCR, Beautifulsoup, Torchvision, ResNet
• Software Engineering: AWS EC2/RDS/S3/CloudWatch/DynamoDB/MediaConnect/SQS, GCP, MySQL, PostgreSQL, NoSQL, Redis, MongoDB, AsyncIO, Kafka, Flask, Ruby on Rails, Spring Boot, Thymeleaf, MVC design, Agile, Nginx
• Audio Processing: Whisper, Demucs, SpeechT5, Elevenlabs, PlayHT, PyAnnote, Resemblyzer, SpeechBrain, Pydub, Stable-ts, PyRubberband, AssemblyAI, FFmpeg, Deepgram, Prosody, SSML, Librosa, Pyloudnorm
• Software Testing: Selenium, Appium, Android Debug Bridge (ADB), Monkey, Monkey Runner, Autopsy, Ghidra, GH Actions
• Collaboration & OS: Kubernetes (K8s), Docker, GitHub, GitLab, Jira, Confluence, Linux RESEARCH & WORK EXPERIENCE
• Lingopal (Senior Machine Learning Engineer) New York City, May 2024–present
• Lingopal (Machine Learning Engineer — Intern) New York City, Jan–May 2024
NLP & GenAI: Developed & evaluated SOTA paraphrasing, summarization, emotion extraction, translation & language detection systems using various transformer/GPT models & performed prompt engineering & Chain-of-Thought prompting using DSPy.
ML Roadmap: Created a comprehensive roadmap for fine-tuning and training LLMs, ASR and audio processing models.
Gender Classifier: Designed and built a gender classification model based on audio embeddings with 98% accuracy and F1-score.
Multilingual Paraphrasing: Developed evaluation metrics and dynamic thresholds for paraphrasing based on the speakers’ speed.
Dynamic Audio Chunking: Designed and developed a dynamic approach for splitting audio stream for precise real-time processing
Multilingual Transcription: Designed, implemented & evaluated an approach to accurately transcribe multilingual audio segments
Audio Processing: Developed SOTA audio embedding and speaker identification, voice cloning, sentiment classification, transcription, text-to-speech with emotions, audio placement, audio segmentation, vocal separation, and audio denoising systems.
• Privacy in Social Computing & Human Language Technologies Labs (Research Assistant)Penn State University, May 2020–Jul 2024
Analysis of Extremist & Terrorist Groups on Twitter (Thesis Research, Partially funded by NSF):
* Collected and analyzed a longitudinal dataset of 10 million tweets & metadata from likely ISIS affiliates using Twitter/X API.
* Collected and investigated over 55 million tweets and user networks regarding the WomenLifeFreedom uprising in Iran.
* Fine-tuned LLMs such as XLM-RoBERTa & ParsBERT on Iran’s corpus and developed user classifiers for both projects.
* Developed ensemble user classifiers for the Iranian dataset using individual user attributes, social networks, and tweets.
* Performed image and text classification on ISIS photos and texts extracted from them using OCR.
* Designed a large-scale label bootstrapping schema using active learning, weakly-supervised & semi-supervised learning.
* Developed methods and models for detection of propaganda messages & techniques used by ISIS and the Iranian regime.
Cyberbullying Defense & Privacy in Online Social Networks:
* Collected and curated a corpus of likely doxing tweets using Twitter/X API
* Built an automated detector of sensitive information disclosures on Twitter using FLAIR contextualized string embeddings
* Presented a comprehensive insight into potential intentions behind sensitive information disclosures
* Designed and performed online crowdsourcing using Amazon Mechanical Turk
Preparation and Analyses of PrivaSeer and PrivaSeerQA Datasets (Funded by NSF):
* Compiled a corpus of 4 million privacy policies with enhanced metadata and organization for public release
* Formalized and organized a corpus of 1.3 million question-answer pairs extracted from one million privacy policies
* Compared privacy policies at scale and identified common patterns and practices among distinct websites and industries
* Built context-aware policy information retrieval ChatBots using Voiceflow, ChatGPT & Retrieval-Augmented Generation
• Albany Lab for Privacy & Security (Research Assistant) SUNY at Albany, August 2017–May 2020
PANACEA (Personalized AutoNomous Agents Countering Social Engineering Attacks, funded by DARPA ASED):
* Designed and developed web services using Python Flask for detection of phishing and spear-phishing in emails & NER
* Performed user profiling, anomalous behavior detection & authorship verification for senders, recipients & P2P channels
* Developed an impersonation detection component for email senders using Ruby on Rails, PostgreSQL, Redis & Rails ERD
* Developed a stacking ensemble meta-learner for integrating PANACEA base classifiers’ decisions using Python & Ruby
User Review Analysis of Mobile Apps and Amazon Products:
* Implemented unsupervised approaches for aspect extraction and automated identification of privacy- and security-related concerns of users raised in Android app reviews on Google Play Store and product reviews on Amazon
* Implemented text classifications using Word2Vec & GloVe word embeddings, WordNet, N-Grams, BoW, and Bag of Word Clusters
* Designed and configured an online review annotation environment using Brat annotation tool
* Implemented various text cleansing, pre-processing, normalization, stemming, and lemmatization approaches
Inferring Privacy Settings and Privacy Policy Compliance of Android Applications:
* Implemented user simulator clients for testing Android apps using Appium, Selenium, Monkey, Monkey Runner, and Droidbot
* Modified the Android Open Source Project and built a custom Android ROM
* Dumped, analyzed, and inspected several Android applications’ layouts and GUIs using ADB, Dumpsys, XPath, UIAutomator, Android Studio Hierarchy Viewer, and Layout Inspector
Inference & Conformance Testing of Access Control Models:
* Reviewed ReBAC, ABAC, RBAC, and OrBAC access control models, their conformance testing, test suit generation, and use of state machines, petri nets, and PrT nets for model illustration & test modeling
* Conducted research and experiments to infer/create access control models for the applications by detecting the sensitive resources & user data in the HTTP traces of open source web applications (WordPress, Elgg, and Funkwhale) PAPERS, POSTERS & PEER-REVIEWS
• Anonymous paper, submitted to EMNLP, 2024
• A Longitudinal Dataset & Analysis of Twitter ISIS Users and Propaganda, Springer–Nature SNAM journal, 2024.
• A Longitudinal Dataset of Twitter ISIS Users, arXiv, 2022
• COVID-19 and Haters — A User Model Perspective. IEEE DSAA, 2022
• Automated Detection of Doxing on Twitter. ACM CSCW, 2022
• Modeling Longitudinal Behavior Dynamics Among Extremist Users in Twitter Data. IEEE BigData, 2021
• Active Defense Against Social Engineering: The Case for Human Language Technology. LREC, 2020
• Modeling Social Engineering Risk Using Attitudes, Actions, & Intentions Reflected in Language Use. FLAIRS-32, 2019
• PapiaPass: Sentence-based Passwords using Dependency Trees. ISC Conf. on Information Security & Cryptology, 2016
• Deniable Encryption based on Standard RSA with OAEP. International Symposium on Telecommunications, 2016
• Peer-Review: ACL, NAACL, ACL-IJCNLP, ACL ARR, EMNLP, ACM CODASPY, IEEE CIC, IEEE IRI, IEEE BigData EXTRACURRICULAR
• Attended the 7th annual CyberTruck Challenge & assessed a novel generation of Bosch CAN gateway devices 2023
• Student volunteer at the 15th ACM International Conference on Web Search and Data Mining (WSDM’22) 2022
• Attended the 3rd annual CyberTruck Challenge & performed Penetration Testings on a novel generation of Omnitracs ELDs 2019
• Graduate mentor for young computer science researchers of the Girls Inc. Eureka! program at UAlbany 2019
• Active member of the Student Branch of Iran Society of Cryptology at the University of Tehran 2015–2017
• Vice-chairman of IEEE University of Tehran Student Branch 2014–2015 TEACHING EXPERIENCE
• Penn State University: Overview of Information Security, Malware Analytics, Cybersecurity Analytics and Operations Capstone, Security Analytics Studio
• SUNY at Albany: Systems Programming, Computer Organization & Architecture, GPU Architecture & Programming
Update: October 14, 2024