Post Job Free
Sign in

Data Engineer - Spark, Kafka, Data Pipelines, PL/SQL

Location:
Hanoi, Vietnam
Posted:
December 20, 2025

Contact this candidate

Resume:

SUMMARY

+848********

**********************@*****.***

DATA ENGINEER Doi Can, Ba Dinh, Hanoi

NGUYEN TRUONG GIA HUY

Motivated Data Engineer with a solid foundation in building and managing data pipelines for ingestion, transformation, and transfer. Proficient in big data tools such as Apache Spark and Kafka, with hands-on experience in handling large-scale datasets. Skilled in working with databases and eager to contribute technical expertise and a proactive learning mindset to deliver impactful data solutions in a dynamic team environment.

Data Crawling - Selenium, Call API

Data Transformation - Spark, Python, Java

Data Transfer - Kafka, RabbitMQ

Data Lake - Hadoop, MongoDB

Data Warehouse - MySQL, MongoDB, PostgreSQL

Database Cache - Redis

Data Visualization - Matplot, Seaborn, PowerBI, Tableau ML/AI - Transformer and Scikit-learn

DevOps - Docker

TECHNICAL SKILLS

SOFT SKILLS Languages: Vietnam (native), English (Professional - C1) and French (B1) Office Tools: Word, PowerPoint, Excel

PROGRAMMING

LANGUAGES

Python, Java and SQL (T-SQL & PLSQL)

PERSONAL PROJECT Data Synchronization: MySQL to MongoDB and Redis April 2025 - May 2025 Logic 1: Set up 3 databases (MySQL, MongoDB and Redis) using Docker Pull database image

Set users and passwords

Create schema for databases and Validate schema

Logic 2: Insert data into 3 databases simultaneously using Spark Set up and config Spark

Write data into MySQL, MongoDB, Redis

Validate data written by Spark for not having any missing data and all data imported is correct Logic 3: Use Debezium to implement CDC on MySQL

Set up Debezium (including Kafka, Zookeeper and Debezium-MySQL connector) Debezium write change reports to Kafka after change events in real time Logic 4: Use Spark Streaming to read messages from Kafka Report message of change events sent through Kafka is read by Spark. Schema is applied to Spark read to crop before-change and after-change data only Comparison between before-change and after-change data indicates table operations

(INSERT, UPDATE, DELETE)

Logic 5: Use Spark Streaming to synchronize others databases with changes fetched from Kafka Changes by table operation types are streamed/applied to MongoDB and Redis in near real- time using Spark Streaming.

Each modification to MongoDB and Redis is followed up by validation to ensure accuracy Language: Python

Technical Stacks: Spark, Debezium, Kafka, Docker

Libraries: Python Cursor, Pyspark, kafka-python, mysql-connector-python, pymongo, redis Databases: MongoDB, MySQL, Redis

Project Github: https://github.com/GiaHuy-Miguel/Data-Synchronization-MySQL-to-MongoDB-and-Redis Description: The project is to synchronize three different databases: MySQL (for warehousing), MongoDB

(for fast querying) and Redis (for caching) in real-time. Any changes to MySQL will be detected and streamed/applied directly into other databases. Validation is included between steps to ensure accuracy July 2025 - October 2025

Main Role:

Set up database, schema and data model on SSMS as domain Datamart Configure data pipeline to transfer data from data warehouse (Oracle) to data mart (SSMS) that ensures data integrity

Write advanced SQL logic in stored procedure

Write Table-Valued Function and View

Optimize SQL query to boost up performance (from

2 minutes -> 45 seconds for query on SQL Server)

Config and Perform Table Partitioning, and Data Cut-off Set up scheduled job on SQL Server with SQL Job Agent Additional Role:

Handle ad-hoc tasks

Develop Tableau solution and performing demos

Support BA in communication with users

Position: Data Engineer Intern

EXPERIENCE HONDA VIETNAM COMPANY LTD

Logic 1: Descriptive Analysis

Print the first 20 rows to have a brief look at the data Perform statistical description of data

Check for existing null value and replace them

Check if datatype of variables correspond to their actual value in the dataset Logic 2: Exploratory Data Analysis

Check for data distribution, if not normal distribution, perform standardization Check for correlation among all variables. Since their correlations are fairly low, further steps occur

Check for correlation between distance features and house pricing using scatter plot and box plot. Mapping cities from the dataset on Google Map shows that the dataset mostly consists coastal location, thus, unnecessary features are dropped Check for correlation between median features and house pricing using scatter plot Divide latitude and longitude into clusters using K-Means. Comparison among those clusters’ impacts on house pricing show no difference, so the mentioned variables will not be categorized Logic 3: Data Wrangling

Calculating new attributes: rooms per household, bedrooms to room ratio and population per household

Recheck their correlation

Drop unnecessary variables

Logic 4: Sample Split

Based on median income being the most impactful variable, perform stratified shuffle split based on income

Assign train sample (80%) and test sample (20%)

Logic 5: Regressive Analysis

Linear Regression, Polynomial Regression, Decision Tree and Random Forest are all used Tune models with Randomized Search

Logic 6: Features Re-evaluation

SHAP Explainer is used to evaluate which contribute the most to the house pricing prediction Language: Python

Technical Stack: Folium, SHAP, Matplot, Seaborn and Scikit-Learn Description: This project is to using a house pricing dataset (including median price, total population, total bedrooms and rooms of the whole area, etc.) to visualize house location and predict pricing. SHAP Explainer is used to further analyze impact of each variables, resulting in an insight that house prices are highly affect by longitude due to changes in climate across North and South Building Model to Predict Median House Price December 2023 - January 2024 Project Github: https://github.com/GiaHuy-Miguel/DISTRICT-X-HOUSES-PRICE-PREDICTION Certificate of course completion from FTU and Datapot: "Data Science in Economy and Business" with highest overall score

IELTS - Overall: 7.5, Reading: 8.5, Listening: 8.5, Writing: 7.0, Speaking: 6.0 DELF B1

CERTIFICATIONS

Hanoi

September 2022- June 2026 (expected)

EDUCATION Bachelor of International Business Economics Foreign Trade University

GPA (current): 3.5

Logic 3: Input fetched titles into the model for prediction Tokenize titles fetched

Load into pre-trained model and predict using Softmax Regression Results return as: Positive, Negative and Neutral over 100%. Any attribute larger than 50% will be considered the sentiment of the title Logic 4: Sentiment returned by the model will be included in the initial python list Logic 5: The list with 4 information: stock symbol, title, timestamp and sentiment will be write into a csv file for futher processing

Logic 6: Continue with the next company and repeat the process above with a for-loop Logic 1: Read Excel file containing stock symbols of targeted listed companies Logic 2: Web surface crawling with Selenium to receive article titles on CafeF Get URL directing to the web page of CafeF containing all articles for one listed company Fetched all information on that web page, including stock symbol, titles and timestamp, into a python list

Move to next page, wait for loading and continue the above process Language: Python

Technical Stacks: Transformer, Selenium

Firm Sentiment Prediction with PhoBert December 2024 - January 2025 Project Github: https://github.com/GiaHuy-Miguel/Firm-Sentiment-_-Crawling-AI-Project Description: This project crawls articles from CafeF.com and imported into the AI model for sentiment evaluation. Results then quantified into a proxy index indicating how well the company has operated that year



Contact this candidate