Data warehouse, Dbt, Airflow, Spark

Location:

Hanoi, Vietnam

Posted:

February 18, 2025

Contact this candidate

Resume:

Cao The Binh Phuong

Phung Khoang - Trung Van - Nam Tu Liem - Ha Noi

088******* ***********@*****.*** andrewcao.site Asim Group - Web dev November 2022 - June 2023

Inda - Data engineer intern March 2024 - June 2024 Hebela - Data engineer Fresher July 2024 - January 2025 THANG LONG UNIVERSITY - Information technology 2020 - 2024 Programing language SQL, Python, a little Java

Technology PostGresql, Docker, Kafka, MinIO-Iceberg, Spark, Dbt, Airflow. Language English (Conversational)

700 points - Internal TOEIC certificate

2023

Problem solving - HackerRank

2024

SQL-HackerRank 2024

Data warehouse for Adventure Works DB - Data engineer 11/2024 - 12/2024 OBJECTIVE

I have nearly one year of experience as a Data Engineer, with a strong understanding of key concepts and technologies in data engineering. I have built ELT systems and a data warehouse using open-source platforms such as dbt, Airflow, Trino, and Spark. I aim to develop expertise in designing and orchestrating real-time data pipelines and big data systems using open-source platforms, as well as cloud services like AWS and GCP. WORK EXPERIENCE

Main Tech Stacks: ReactJS, ExpressJS.

Role Overview: Worked as a web developer, responsible for building and implementing features of a media website, including front-end development, interactive components, and API integration. Task Details: Converted Figma designs to HTML/CSS, developed React.js features like SSO login and video player, and supported API development.

Main Tech Stacks: Spark, Docker, Trino, Iceberg

Role Overview: Worked as a Data Engineering Intern, responsible for setting up essential tools like Spark and Trino using Docker and building an ELT pipeline with Spark.

Task Details: First, my team and I researched and wrote the Docker Compose configuration. We retrieved data from e- commerce platform APIs, flattened it into a tabular format, and then transformed and loaded it into fact and dimension tables. Main Tech Stacks: Trino, DBT, Docker, Iceberg

Role Overview: Worked as a Data Engineer Fresher, responsible for designing and building parts of a data warehouse and ELT pipelines, customize several data tools.

Task Details: Designed dimension and fact tables following the Star Schema model. performed ELT from the ERP system to the data warehouse, set up a tracking action pipeline, customize some features of Superset, develop a tool to sync data. EDUCATION

GPA: 3.3/4

During my time at Thang Long University, I have gained a solid foundation in Information Technology. My studies have covered a widerange of topics, including programming languages, web development, database management, and networking. Additionally, mycoursework has helped me develop problem-solving skills, critical thinking, and the ability to adapt to new technologies, all of which areessential in the IT field. SKILLS

CERTIFICATIONS

PERSONAL PROJECTS

Context: I want to built a stable data warehouse and effective ELT pipeline by popular open source tool. Main tech stacks: Spark, DBT, Airflow, MinIo-Icerberg, Clickhouse. Description: I use Spark to extract data from two databases and load it into a MinIO-Iceberg staging area. Then, I transform the data into fact and dimension tables using batch processing. I leverage dbt to create analysis-ready cubes and store them in ClickHouse for improved query performance.

Github: https://github.com/andrewcaoo/AV_WH

Processing US pharmacy data by Spark - Data engineer 03/2024 - 03/2024 Data pipeline for customer data - Data engineer 07/2024 - 07/2024 Near real time data pipeline with spark streaming - Data engineer 01/2025 - 01/2025 Context: This is a practice project where I applied all my knowledge of PySpark syntax and how to set up Spark, Hadoop on local machine.

Main tech stacks: Hadoop, Spark.

Description: I use Spark to implement data cleaning and transformation steps, such as standardizing names, handling null values, and calculating measures. Finally, I load the processed data into a data warehouse built on PostgreSQL. Github:

Context: I built a complete data pipeline, from data extraction and transformation to loading into a data warehouse and visualizing it on a BI tool.

Main tech stacks: Mage, Trino, Minio-iceberg.

Description: I collect customer data daily for reporting by extracting it via API with Python, storing it in a MinIO data lake, and performing incremental loading using SCD Type 2. Trino is then used to query the lakehouse. Power Bi connected to base warehouse and visualize.

Github:

Context: I built this project to practice Kafka and Spark Streaming. Main tech stacks: Kafka, Spark, Cassandra, Airflow. Description: I used Airflow to orchestrate data retrieval from an API every 10 seconds and push it to a Kafka broker. Downstream, I used Spark to consume, process, and store the data in a Cassandra database. Github:

https://github.com/andrewcaoo/Processing_data_by_local_hadoop_and_spark https://github.com/andrewcaoo/Customer_data_pipeline https://github.com/andrewcaoo/Sample_streaming_pipeline

Contact this candidate