SUMMARY
**********************@*****.***
DATA ENGINEER Doi Can, Ba Dinh, Hanoi
NGUYEN TRUONG GIA HUY
Motivated Data Engineer with a solid foundation in building and managing data pipelines for ingestion, transformation, and transfer. Proficient in big data tools such as Apache Spark and Kafka, with hands-on experience in handling large-scale datasets. Skilled in working with databases and eager to contribute technical expertise and a proactive learning mindset to deliver impactful data solutions in a dynamic team environment.
Data Crawling - Selenium, Call API
Data Transformation - Spark, Python, Java
Data Transfer - Kafka, RabbitMQ
Data Lake - Hadoop, MongoDB
Data Warehouse - MySQL, MongoDB, PostgreSQL
Database Cache - Redis
Data Visualization - Matplot, Seaborn, PowerBI, Tableau ML/AI - Transformer and Scikit-learn
DevOps - Docker
TECHNICAL SKILLS
SOFT SKILLS Languages: Vietnam (native), English (Professional - C1) and French (B1) Office Tools: Word, PowerPoint, Excel
PROGRAMMING
LANGUAGES
Python, Java and SQL (T-SQL & PLSQL)
PERSONAL PROJECT Data Synchronization: MySQL to MongoDB and Redis April 2025 - May 2025 Logic 1: Set up 3 databases (MySQL, MongoDB and Redis) using Docker Pull database image
Set users and passwords
Create schema for databases and Validate schema
Logic 2: Insert data into 3 databases simultaneously using Spark Set up and config Spark
Write data into MySQL, MongoDB, Redis
Validate data written by Spark for not having any missing data and all data imported is correct Logic 3: Use Debezium to implement CDC on MySQL
Set up Debezium (including Kafka, Zookeeper and Debezium-MySQL connector) Debezium write change reports to Kafka after change events in real time Logic 4: Use Spark Streaming to read messages from Kafka Report message of change events sent through Kafka is read by Spark. Schema is applied to Spark read to crop before-change and after-change data only Comparison between before-change and after-change data indicates table operations
(INSERT, UPDATE, DELETE)
Logic 5: Use Spark Streaming to synchronize others databases with changes fetched from Kafka Changes by table operation types are streamed/applied to MongoDB and Redis in near real- time using Spark Streaming.
Each modification to MongoDB and Redis is followed up by validation to ensure accuracy Language: Python
Technical Stacks: Spark, Debezium, Kafka, Docker
Libraries: Python Cursor, Pyspark, kafka-python, mysql-connector-python, pymongo, redis Databases: MongoDB, MySQL, Redis
Project Github: https://github.com/GiaHuy-Miguel/Data-Synchronization-MySQL-to-MongoDB-and-Redis Description: The project is to synchronize three different databases: MySQL (for warehousing), MongoDB
(for fast querying) and Redis (for caching) in real-time. Any changes to MySQL will be detected and streamed/applied directly into other databases. Validation is included between steps to ensure accuracy July 2025 - October 2025
Main Role:
Set up database, schema and data model on SSMS as domain Datamart Configure data pipeline to transfer data from data warehouse (Oracle) to data mart (SSMS) that ensures data integrity
Write advanced SQL logic in stored procedure
Write Table-Valued Function and View
Optimize SQL query to boost up performance (from
2 minutes -> 45 seconds for query on SQL Server)
Config and Perform Table Partitioning, and Data Cut-off Set up scheduled job on SQL Server with SQL Job Agent Additional Role:
Handle ad-hoc tasks
Develop Tableau solution and performing demos
Support BA in communication with users
Position: Data Engineer Intern
EXPERIENCE HONDA VIETNAM COMPANY LTD
Logic 1: Descriptive Analysis
Print the first 20 rows to have a brief look at the data Perform statistical description of data
Check for existing null value and replace them
Check if datatype of variables correspond to their actual value in the dataset Logic 2: Exploratory Data Analysis
Check for data distribution, if not normal distribution, perform standardization Check for correlation among all variables. Since their correlations are fairly low, further steps occur
Check for correlation between distance features and house pricing using scatter plot and box plot. Mapping cities from the dataset on Google Map shows that the dataset mostly consists coastal location, thus, unnecessary features are dropped Check for correlation between median features and house pricing using scatter plot Divide latitude and longitude into clusters using K-Means. Comparison among those clusters’ impacts on house pricing show no difference, so the mentioned variables will not be categorized Logic 3: Data Wrangling
Calculating new attributes: rooms per household, bedrooms to room ratio and population per household
Recheck their correlation
Drop unnecessary variables
Logic 4: Sample Split
Based on median income being the most impactful variable, perform stratified shuffle split based on income
Assign train sample (80%) and test sample (20%)
Logic 5: Regressive Analysis
Linear Regression, Polynomial Regression, Decision Tree and Random Forest are all used Tune models with Randomized Search
Logic 6: Features Re-evaluation
SHAP Explainer is used to evaluate which contribute the most to the house pricing prediction Language: Python
Technical Stack: Folium, SHAP, Matplot, Seaborn and Scikit-Learn Description: This project is to using a house pricing dataset (including median price, total population, total bedrooms and rooms of the whole area, etc.) to visualize house location and predict pricing. SHAP Explainer is used to further analyze impact of each variables, resulting in an insight that house prices are highly affect by longitude due to changes in climate across North and South Building Model to Predict Median House Price December 2023 - January 2024 Project Github: https://github.com/GiaHuy-Miguel/DISTRICT-X-HOUSES-PRICE-PREDICTION Certificate of course completion from FTU and Datapot: "Data Science in Economy and Business" with highest overall score
IELTS - Overall: 7.5, Reading: 8.5, Listening: 8.5, Writing: 7.0, Speaking: 6.0 DELF B1
CERTIFICATIONS
Hanoi
September 2022- June 2026 (expected)
EDUCATION Bachelor of International Business Economics Foreign Trade University
GPA (current): 3.5
Logic 3: Input fetched titles into the model for prediction Tokenize titles fetched
Load into pre-trained model and predict using Softmax Regression Results return as: Positive, Negative and Neutral over 100%. Any attribute larger than 50% will be considered the sentiment of the title Logic 4: Sentiment returned by the model will be included in the initial python list Logic 5: The list with 4 information: stock symbol, title, timestamp and sentiment will be write into a csv file for futher processing
Logic 6: Continue with the next company and repeat the process above with a for-loop Logic 1: Read Excel file containing stock symbols of targeted listed companies Logic 2: Web surface crawling with Selenium to receive article titles on CafeF Get URL directing to the web page of CafeF containing all articles for one listed company Fetched all information on that web page, including stock symbol, titles and timestamp, into a python list
Move to next page, wait for loading and continue the above process Language: Python
Technical Stacks: Transformer, Selenium
Firm Sentiment Prediction with PhoBert December 2024 - January 2025 Project Github: https://github.com/GiaHuy-Miguel/Firm-Sentiment-_-Crawling-AI-Project Description: This project crawls articles from CafeF.com and imported into the AI model for sentiment evaluation. Results then quantified into a proxy index indicating how well the company has operated that year