Data Engineering Intern - ETL/Streaming Pipelines

Location:

Thien An, Dak Lak, Vietnam

Posted:

June 24, 2026

Contact this candidate

Resume:

PHAN ANH TÀI

Data Engineering Intern

Ho Chi Minh City, Vietnam 032*-***-*** ***************@*****.*** github.com/tai20605

CAREER OBJECTIVE

Final-year Data Science student at UEH with hands-on experience building end-to-end data pipelines using modern data stack technologies. Seeking a Data Engineering internship to contribute to ETL/ELT pipeline development, data warehouse design, and real-time stream processing in a professional team environment.

EDUCATION

University of Economics Ho Chi Minh City (UEH) 2023 – 2027

Bachelor of Science Candidate — Data Science GPA: 3.4 / 4.0

TECHNICAL SKILLS

Languages: Python, SQL

Data Engineer: Apache Airflow, Apache Kafka, Apache Spark, ETL/ELT, Data Modeling

Big Data & Lakehouse: BigQuery, PostgreSQL, SQL Server, Apache Iceberg, dbt, Trino, MinIO, GCS, HDFS

Infra & Cloud: Docker, Terraform, Git/GitHub, Google Cloud Platform (GCP)

PROJECTS

Olist E-Commerce Data Platform · Github

•Designed and implemented a production-grade data lakehouse to process real-time and batch transactions for the Olist e-commerce business under a decoupled, scalable architecture.

•Designed a Kafka-based ingestion architecture processing 1,000+ events/sec from JSON sources, utilizing Spark Structured Streaming to write raw data directly to Apache Iceberg tables on MinIO with an ingestion latency of under 15 seconds.

•Orchestrated the automated end-to-end ELT workflow using Apache Airflow, running dbt transformations over a Trino query engine to support schema evolution and structure raw logs into a Gold Snowflake Schema, which is subsequently loaded into a PostgreSQL warehouse via a PySpark batch JDBC export.

•Guaranteed zero data loss and 99.9% pipeline uptime by enforcing strict data quality gates via automated dbt testing (executing 39 tests), performing automated volume anomaly detection against a 7-day average, and feeding Grafana dashboards for real-time health monitoring.

Boston Rideshare Platform · Github

•Built a streaming platform simulating real-time Uber and Lyft rideshare events in Boston, processing 2,000+ events per second under stress testing and resolving the critical need for live data validation, structural modeling, and pipeline latency monitoring.

•Ingested live ride event streams via Apache Kafka and deployed a PySpark Streaming job to write raw data as incremental Parquet files to Google Cloud Storage (GCS) for cost-effective storage.

•Automated the entire workflow using Apache Airflow, leveraging custom dependency checks to perform stream flow verification (maintaining an ingestion lag of under 2 minutes), refresh BigQuery external tables, and trigger dbt Core to build an optimized Star Schema that reduced query runtimes by 35%, while managing all cloud resources via infrastructure-as-code with Terraform.

•Established a robust data quality observability framework by implementing custom dbt tests to calculate a composite quality score with a 99.5%+ compliance rate, visualizing SLA compliance metrics on a Grafana dashboard.

•CERTIFICATIONS & LANGUAGE

•IBM Data Science Professional Certificate — IBM / Coursera (2026)

•Python for Everybody Specialization — University of Michigan / Coursera (2024)

•Mastering Big Data Analytics — Great Learning (2026)

•TOEIC 725 — ETS (2026)

•SOFT SKILLS

•Problem-Solving & Critical Thinking: Systematically debugged complex pipeline failures and designed data architecture from scratch across multiple end-to-end projects.

•Self-Learning: Independently explored and adopted data stack tools (Kafka, Spark, dbt, Airflow) through official documentation, open-source projects, and hands-on experimentation.

•Teamwork & Collaboration: Contributed as a volunteer at Summit AI 2026, collaborating effectively with cross-functional organizing committees to support operations at a large-scale AI community event.

•Attention to Detail: Enforced data integrity by configuring comprehensive dbt tests (uniqueness, nulls, relationships) to catch anomalies across data layers.

Contact this candidate