Data Scientist Machine Learning

Location:

Quan Thu Duc, Vietnam

Posted:

August 24, 2025

Contact this candidate

Resume:

Khang Nguyen — Fresher Data Scientist

Thu Duc City, Ho Chi Minh City – Viet Nam

I 094******* • # ****************@*****.*** • § khang3004 — ï KhangDS Biography

I am a recently graduated Data Science student with a strong background in Mathematics, Statistics, Optimization, and Machine Learning. I build end-to-end data/ML pipelines grounded in rigorous math and practical engineering. My internship at Vietnam Silicon (May–Aug 2025) gave me hands-on experience in HR Intelligence and Agentic AI. Experience

Vietnam Silicon Company Ho Chi Minh City

Data Scientist Intern May 2025–Aug 2025

Project: Talent Market Intelligence System (team size: 4 members). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem: Traditional recruitment was time-consuming and inaccurate: HR spent 80% of time screening CVs manually, lacked objective evaluation vs. JD, no standardized scoring, no automated CV-JD matching, and fragmented candidate data. Goal: cut screening time by 70% and improve accuracy.

Solution:

Architecture: Microservices with API Gateway, Parsing, Extraction, and Benchmark services; async task distribution via RabbitMQ; Docker/Compose for scalable deployment.

AI/ML Pipeline: CV/JD parsing with Google Gemini + prompt engineering; FAISS + Sentence Transformers for semantic search; pgvector with HNSW for fast similarity search; multi-criteria scoring (Skills, Experience, Education, Language, Certificates).

Scoring System: Config-driven JSON rules; weighted algorithm (Skills 30%, Experience 30%, Education 15%, Language 15%, Certificates 10%); cosine similarity for skills matching; requirement-based logic for JD compliance.

Security: Enterprise-ready with Azure AD SSO, JWT, role-based authorization, secure file storage on S3. Results:

Reduced screening time by 75% (20 min/CV to 5 min/CV); batch-processed 100+ CVs in parallel.

Achieved 92% accuracy in skills matching; response time 3s for full pipeline.

Delivered multi-dimensional scoring (5 criteria, 100 pts each) with top-K ranking.

Ensured 99.5% uptime with health checks, optimized DB indexes, and robust error handling. Tech Stack: FastAPI, Python 3.11, SQLAlchemy, Pydantic, PostgreSQL + pgvector, Redis, AWS S3, FAISS, Sentence Transformers, spaCy, PyTorch, RabbitMQ, Docker/Compose, Azure AD, JWT. Extended Flow: ChatTMI Conversational Agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Challenge: HR professionals still lacked a natural interface to interact with complex HR data. There was no AI assistant capable of reasoning, multi-turn memory, or real-time streaming for smooth conversations. Approach:

MCP-Based Agent: Built on Model Context Protocol; implemented 50+ MCP Prompts, Resources, Tools for HR operations; integrated with LangChain MCP Adapters; applied ReAct Agent pattern orchestrated with LangGraph.

Advanced AI: Leveraged Gemini 2.5 Flash as the core LLM with intelligent routing; memory-enabled conversations with MemorySaver; 24 Gemini API keys with automatic rotation for high availability; real-time streaming with SSE; 8+ customizable system prompt personalities.

Data Intelligence: Autonomous schema learning; semantic relationship discovery; recursive JSON parsing for complex structures; anomaly detection and data profiling.

Modern Conversational UI: Glass-morphism responsive PWA; progressive markdown rendering; interactive quick actions; tables and charts embedded directly in chat responses. Impact:

Fully automated HR database queries, eliminating manual SQL interactions.

Achieved sub-2s latency for real-time conversations with persistent multi-thread memory.

Delivered 50+ MCP tools, 24-API key failover rotation, 8+ conversation personalities, and 25+ production-ready endpoints.

Enhanced user experience with 60fps glass-morphism animations, mobile-first PWA design, and progressive markdown rendering.

1/2

Education

HUTECH University Ho Chi Minh City

Bachelor in Data Science, GPA: 3.7/4.0 2021–2025

Class President; Bronze Medalist at Vietnam Mathematical Olympiad 2023–2024 AI Vietnam Remote

AIO2023 Program 2023–2024

Intensive program on advanced AI/ML & Data Science. Ly Tu Trong Highschool for the Gifted students Can Tho City Mathematics (Gifted Program), GPA: 9.0/10.0 2018–2021 Encouragement Prize at Vietnam Mathematical Olympiad 2020 Selected Personal Projects

Booking.com Hotel Analytics GitHub. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem: Predict hotel review scores, segment the market, and classify quality tiers using multi-modal data

(text/image/metadata).

Solution: Designed a combined pipeline: ResNet18 for image features fused with tabular inputs for regression; unsupervised K-Means/DBSCAN for segmentation; multi-class quality classification with a stacking ensemble (SVM, KNN, Decision Tree, Random Forest Logistic Regression). Included Docker scripts for reproducible tasks and directory structure for raw/processed data, models, and results. Results: Regression — RMSE = 0.85, R2 = 0.78, MAE = 0.67. Classification — Accuracy = 0.84, F1 = 0.82, ROC–AUC = 0.89. Clustering — Silhouette = 0.76 with 3 optimal clusters. Tech Stack: Python, PyTorch, Scikit-learn, Pandas, Seaborn/Matplotlib, Docker. Vietnamese Financial Sentiment Analysis GitHub. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem: Classify sentiment of Vietnamese stock-market headlines and link to market movement signals. Solution: Full NLP pipeline with Vietnamese tokenization (VnCoreNLP), feature engineering, and hybrid ML/DL mod- eling (GaussianNB/LogReg/RandomForest/XGBoost and LSTM/BiLSTM/PhoBERT). Managed data and predictions with MongoDB; provided environment/Docker specs and out-of-sample evaluation protocol. Results: Best model (PhoBERT) — Accuracy 89.2%; BiLSTM/Ensemble 87.5%. Data splits reported: train 2, 151 rows; test 1, 668 rows; out-of-sample 365 rows with balanced labels. Tech Stack: Python, Transformers/PhoBERT, scikit-learn, XGBoost, MongoDB, Docker. Technical Skills

Programing Languages: Python (Intermediate, 3 yrs), R, Java & JavaScript (Basic) ML/DL: PyTorch, FAISS, scikit-learn, Transformers, Statsmodels, CVXOPT LLMs & NLP: A2A & MCP Protocols, LangChain, LangGraph, ReAct Agent, CrewAI, Ollama, NLTK Data Analysis & Manipulation: Pandas, NumPy, Visualization (Matplotlib/Seaborn) Big Data: Spark, Databricks, Kafka

Databases: PostgreSQL, MongoDB, MySQL, SQL Server, Milvus, Chroma Tools: Git, Docker, n8n, Figma, Streamlit, Postman, Trello/Notion, Excel, LATEX Accomplishments

Certifications: DeepLearning.AI – CNN Specialization; DataCamp – Data Scientist Professional with Python Scholarship: HUTECH Talent Scholarship Level 1 (2023–2024) Competition: HCM-AIC 2024 with team AIO-Chef, 2 Bronze Medals – Vietnam Mathematical Olympiad (2023, 2024) Languages

IELTS: Overall 6.0

2/2

Contact this candidate