Machine Learning Engineer, Data Science, NLP Engineer, AI Engineer

Location:

Arlington, VA

Posted:

November 13, 2024

Contact this candidate

Resume:

Younes Karimi Email: *.******.***@*****.***

LinkedIn: YounesKarimi

Location: Washington, D.C.

EDUCATION Website: younes-karimi.github.io

Penn State University, College of Information Sciences & Technology (GPA: 3.96) Pennsylvania, USA Ph.D. in Informatics Areas: Data Science, Natural Language Processing, Privacy, Cybersecurity August 2024 State University of New York at Albany New York, USA M.Sc. in Computer Science Areas: Machine Learning, Natural Language Processing, Privacy, Security August 2020 University of Tehran Tehran, Iran

B.Sc. in Electrical Engineering Area: Telecommunications August 2017 TECHNICAL SKILLS

• Languages: Python (expert/most-used), SQL, Java, Bash, Ruby, JavaScript, C, C++, MATLAB, HTML, CSS, XML

• Machine Learning, Deep Learning and Data Science: Scikit-learn, PyTorch, TensorFlow, Keras, Huggingface, PySpark, NumPy, SciPy, Pandas, Multithreading, Multiprocessing, Airflow, Matplotlib, D3.js, Jupyter, NetworkX, BentoML, CTranslate2

• NLP & Computer Vision: Transformers, RAG, FAISS, Elasticsearch, Qdrant, LangChain, LlamaIndex, Chroma, OpenAI GPT, Meta Llama, Google Gemini, DSPy, T5, BERT, DeepL, MADLAD, CoEdIT, GTE, BGE-M3, Multilingual-E5, FLAIR, PAWS, MPNet, MiniLM, fastText, GloVe, Word2Vec, NLTK, Gensim, spaCy, TextBlob, CoreNLP, WordNet, SimpleNLG, OCR, Beautifulsoup, Torchvision, ResNet

• Software Engineering: AWS EC2/RDS/S3/CloudWatch/DynamoDB/MediaConnect/SQS, GCP, MySQL, PostgreSQL, NoSQL, Redis, MongoDB, AsyncIO, Kafka, Flask, Ruby on Rails, Spring Boot, Thymeleaf, MVC design, Agile, Nginx

• Audio Processing: Whisper, Demucs, SpeechT5, Elevenlabs, PlayHT, PyAnnote, Resemblyzer, SpeechBrain, Pydub, Stable-ts, PyRubberband, AssemblyAI, FFmpeg, Deepgram, Prosody, SSML, Librosa, Pyloudnorm

• Software Testing: Selenium, Appium, Android Debug Bridge (ADB), Monkey, Monkey Runner, Autopsy, Ghidra, GH Actions

• Collaboration & OS: Kubernetes (K8s), Docker, GitHub, GitLab, Jira, Confluence, Linux RESEARCH & WORK EXPERIENCE

• Lingopal (Senior Machine Learning Engineer) New York City, May 2024–present

• Lingopal (Machine Learning Engineer — Intern) New York City, Jan–May 2024

NLP & GenAI: Developed & evaluated SOTA paraphrasing, summarization, emotion extraction, translation & language detection systems using various transformer/GPT models & performed prompt engineering & Chain-of-Thought prompting using DSPy.

ML Roadmap: Created a comprehensive roadmap for fine-tuning and training LLMs, ASR and audio processing models.

Gender Classifier: Designed and built a gender classification model based on audio embeddings with 98% accuracy and F1-score.

Multilingual Paraphrasing: Developed evaluation metrics and dynamic thresholds for paraphrasing based on the speakers’ speed.

Dynamic Audio Chunking: Designed and developed a dynamic approach for splitting audio stream for precise real-time processing

Multilingual Transcription: Designed, implemented & evaluated an approach to accurately transcribe multilingual audio segments

Audio Processing: Developed SOTA audio embedding and speaker identification, voice cloning, sentiment classification, transcription, text-to-speech with emotions, audio placement, audio segmentation, vocal separation, and audio denoising systems.

• Privacy in Social Computing & Human Language Technologies Labs (Research Assistant)Penn State University, May 2020–Jul 2024

Analysis of Extremist & Terrorist Groups on Twitter (Thesis Research, Partially funded by NSF):

* Collected and analyzed a longitudinal dataset of 10 million tweets & metadata from likely ISIS affiliates using Twitter/X API.

* Collected and investigated over 55 million tweets and user networks regarding the WomenLifeFreedom uprising in Iran.

* Fine-tuned LLMs such as XLM-RoBERTa & ParsBERT on Iran’s corpus and developed user classifiers for both projects.

* Developed ensemble user classifiers for the Iranian dataset using individual user attributes, social networks, and tweets.

* Performed image and text classification on ISIS photos and texts extracted from them using OCR.

* Designed a large-scale label bootstrapping schema using active learning, weakly-supervised & semi-supervised learning.

* Developed methods and models for detection of propaganda messages & techniques used by ISIS and the Iranian regime.

Cyberbullying Defense & Privacy in Online Social Networks:

* Collected and curated a corpus of likely doxing tweets using Twitter/X API

* Built an automated detector of sensitive information disclosures on Twitter using FLAIR contextualized string embeddings

* Presented a comprehensive insight into potential intentions behind sensitive information disclosures

* Designed and performed online crowdsourcing using Amazon Mechanical Turk

Preparation and Analyses of PrivaSeer and PrivaSeerQA Datasets (Funded by NSF):

* Compiled a corpus of 4 million privacy policies with enhanced metadata and organization for public release

* Formalized and organized a corpus of 1.3 million question-answer pairs extracted from one million privacy policies

* Compared privacy policies at scale and identified common patterns and practices among distinct websites and industries

* Built context-aware policy information retrieval ChatBots using Voiceflow, ChatGPT & Retrieval-Augmented Generation

• Albany Lab for Privacy & Security (Research Assistant) SUNY at Albany, August 2017–May 2020

PANACEA (Personalized AutoNomous Agents Countering Social Engineering Attacks, funded by DARPA ASED):

* Designed and developed web services using Python Flask for detection of phishing and spear-phishing in emails & NER

* Performed user profiling, anomalous behavior detection & authorship verification for senders, recipients & P2P channels

* Developed an impersonation detection component for email senders using Ruby on Rails, PostgreSQL, Redis & Rails ERD

* Developed a stacking ensemble meta-learner for integrating PANACEA base classifiers’ decisions using Python & Ruby

User Review Analysis of Mobile Apps and Amazon Products:

* Implemented unsupervised approaches for aspect extraction and automated identification of privacy- and security-related concerns of users raised in Android app reviews on Google Play Store and product reviews on Amazon

* Implemented text classifications using Word2Vec & GloVe word embeddings, WordNet, N-Grams, BoW, and Bag of Word Clusters

* Designed and configured an online review annotation environment using Brat annotation tool

* Implemented various text cleansing, pre-processing, normalization, stemming, and lemmatization approaches

Inferring Privacy Settings and Privacy Policy Compliance of Android Applications:

* Implemented user simulator clients for testing Android apps using Appium, Selenium, Monkey, Monkey Runner, and Droidbot

* Modified the Android Open Source Project and built a custom Android ROM

* Dumped, analyzed, and inspected several Android applications’ layouts and GUIs using ADB, Dumpsys, XPath, UIAutomator, Android Studio Hierarchy Viewer, and Layout Inspector

Inference & Conformance Testing of Access Control Models:

* Reviewed ReBAC, ABAC, RBAC, and OrBAC access control models, their conformance testing, test suit generation, and use of state machines, petri nets, and PrT nets for model illustration & test modeling

* Conducted research and experiments to infer/create access control models for the applications by detecting the sensitive resources & user data in the HTTP traces of open source web applications (WordPress, Elgg, and Funkwhale) PAPERS, POSTERS & PEER-REVIEWS

• Anonymous paper, submitted to EMNLP, 2024

• A Longitudinal Dataset & Analysis of Twitter ISIS Users and Propaganda, Springer–Nature SNAM journal, 2024.

• A Longitudinal Dataset of Twitter ISIS Users, arXiv, 2022

• COVID-19 and Haters — A User Model Perspective. IEEE DSAA, 2022

• Automated Detection of Doxing on Twitter. ACM CSCW, 2022

• Modeling Longitudinal Behavior Dynamics Among Extremist Users in Twitter Data. IEEE BigData, 2021

• Active Defense Against Social Engineering: The Case for Human Language Technology. LREC, 2020

• Modeling Social Engineering Risk Using Attitudes, Actions, & Intentions Reflected in Language Use. FLAIRS-32, 2019

• PapiaPass: Sentence-based Passwords using Dependency Trees. ISC Conf. on Information Security & Cryptology, 2016

• Deniable Encryption based on Standard RSA with OAEP. International Symposium on Telecommunications, 2016

• Peer-Review: ACL, NAACL, ACL-IJCNLP, ACL ARR, EMNLP, ACM CODASPY, IEEE CIC, IEEE IRI, IEEE BigData EXTRACURRICULAR

• Attended the 7th annual CyberTruck Challenge & assessed a novel generation of Bosch CAN gateway devices 2023

• Student volunteer at the 15th ACM International Conference on Web Search and Data Mining (WSDM’22) 2022

• Attended the 3rd annual CyberTruck Challenge & performed Penetration Testings on a novel generation of Omnitracs ELDs 2019

• Graduate mentor for young computer science researchers of the Girls Inc. Eureka! program at UAlbany 2019

• Active member of the Student Branch of Iran Society of Cryptology at the University of Tehran 2015–2017

• Vice-chairman of IEEE University of Tehran Student Branch 2014–2015 TEACHING EXPERIENCE

• Penn State University: Overview of Information Security, Malware Analytics, Cybersecurity Analytics and Operations Capstone, Security Analytics Studio

• SUNY at Albany: Systems Programming, Computer Organization & Architecture, GPU Architecture & Programming

Update: October 14, 2024

Contact this candidate