Senior Data Engineer with 8+ years of scalable data solutions

Location:

Los Angeles, CA

Salary:

20USD/hour

Posted:

June 11, 2026

Contact this candidate

Resume:

Yorimichi Sato

Senior Data Engineer

*************@*******.*** Warsaw, Poland

SUMMARY

Senior Data Engineer with over 8 years of experience delivering scalable data solutions across analytics, cloud platforms, and AI-driven applications. Proven expertise in designing modern data warehouses, building robust ETL/ELT pipelines, and optimizing large-scale data processing environments on AWS. Experienced in transforming complex business and research requirements into reliable, high-quality data products that support reporting, advanced analytics, and machine learning initiatives. Strong background in data governance, observability, and infrastructure automation, with a focus on performance, reliability, and cost optimization. Adept at collaborating with cross-functional teams to drive data strategy and enable informed business decisions. SKILLS

Data Engineering

ETL /ELT, Data Modeling, Data Warehousing,

Data Lakehouse, Data Pipelines, Data Quality,

Data Governance, Data Lineage, Metadata

Management, Web Scraping

DevOps & Infrastructure

Terraform, GitHub Actions, CI/CD, Git

Programming

JavaScript, Typescript, node.js, Python, SQL,

Bash

Big Data

Apache Spark, PySpark, Spark SQL, Spark

Streaming, Hadoop

Analytics & BI

Tableau, Excel, Data Analysis, KPI Reporting,

Power BI

AI & Machine learning

OpenAI API, LangChain,Vector Databases, RAG,

Semantic Search, Embeddings, Feature

Engineering, Tensorflow, Pandas, PyTorch

Cloud & DevOps

AWS, Docker, Kubernetes, Terraform, GitHub

Actions, CI/CD Automation, Infrastructure as

Code

Databases

PostgreSQL, MySQL, MongoDB, Cassandra, IBM

Db2, IBM Cloudant

Orchestration & Transformation

Apache Airflow, Dagster, dbt

Monitoring & Data Quality

Great Expectations, PagerDuty

PROFESSIONAL EXPERIENCE

Senior Data Engineer, Softserve 12/2023 – Present Austin, US

-Enterprise Data Platform and Data Warehouse Modernization

•Designed and developed scalable ETL/ELT pipelines ingesting data from APIs, relational databases, event streams, external vendors, and web-scraped sources.

•Built automated web data extraction workflows to collect, validate, transform, and load external business data into enterprise analytics platforms.

•Architected dimensional data models, star schemas, and curated data marts supporting analytics and reporting workloads.

•Built automated transformation frameworks for validation, cleansing, enrichment, and loading processes.

•Implemented workflow orchestration using Airflow and Dagster to manage hundreds of production pipelines.

•Established monitoring, alerting, and observability practices to improve platform reliability.

•Supported self-service analytics initiatives by delivering trusted business datasets and reporting layers.

•Optimized warehouse performance through query tuning, indexing, and partitioning strategies.

-Distributed Data Processing and Cloud Analytics Platform

•Built distributed processing pipelines using PySpark and AWS EMR.

•Developed reusable frameworks for large-scale batch and near-real-time processing.

•Optimized Spark workloads through partitioning, execution plan tuning, and resource management.

•Designed secure cloud architectures leveraging AWS services including ECS, Lambda, EC2, IAM, and Step Functions.

•Automated infrastructure deployment and operational monitoring using AWS-native services.

•Improved scalability and performance of enterprise analytics workloads.

•Collaborated with data scientists and analysts to provide production-ready datasets for advanced analytical use cases.

-Data Governance, Quality, and Observability Framework

•Designed schema validation and reconciliation frameworks.

•Implemented lineage tracking and metadata management solutions.

•Built anomaly detection and monitoring processes for critical datasets.

•Developed automated quality checks across analytical and operational systems.

•Improved data consistency and governance standards across multiple business domains.

•Established operational dashboards and alerting mechanisms for proactive issue detection.

•Partnered with stakeholders to define data quality SLAs and governance policies.

-AI and Machine Learning Data Platform

•Designed and maintained feature engineering pipelines used for predictive analytics and machine learning model training.

•Developed web scraping and data collection pipelines using Python to acquire structured and unstructured content from public websites, knowledge bases, and external sources for AI and analytics applications.

•Built automated extraction workflows handling HTML parsing, pagination, content normalization, metadata enrichment, and data quality validation.

•Integrated scraped content into RAG and vector search pipelines through document preprocessing, chunking, embedding generation, and indexing workflows.

•Built Retrieval-Augmented Generation (RAG) data pipelines integrating enterprise knowledge sources into vector search systems.

•Implemented document chunking, embedding generation, metadata enrichment, and indexing workflows for large-scale knowledge repositories.

•Supported vector database solutions to enable semantic search and LLM-powered question- answering applications.

•Collaborated with machine learning engineers and data scientists to provide high-quality training and inference datasets.

•Automated data validation and monitoring processes for AI and ML pipelines to ensure reliability and model performance.

Data Engineer, Modvise 09/2021 – 10/2023 Poland

-Data Platform Infrastructure Automation and Research Data Warehouse

•Developed Infrastructure-as-Code solutions using Terraform to automate AWS resource provisioning.

•Built CI/CD pipelines with GitHub Actions for deployment and environment management.

•Reduced environment setup and deployment time from two days to less than twenty minutes.

•Automated provisioning and configuration of 12 AWS services supporting data workloads.

•Designed normalized PostgreSQL schemas for five large-scale research datasets containing more than 50 million records.

•Created indexing and query optimization strategies that improved average query performance by 80%.

•Performed data modeling, profiling, validation, and migration activities across multiple source systems.

•Supported analytics teams by delivering optimized warehouse structures and curated datasets.

-Genomics Research Lakehouse and Analytics Platform

•Architected a scalable lakehouse solution using AWS S3, Glue, and Athena.

•Built ingestion pipelines to process and catalog over 8TB of genomic sequencing data.

•Developed metadata management and partitioning strategies to improve query performance.

•Enabled researchers to perform ad hoc SQL analytics on previously inaccessible datasets.

•Eliminated dependency on custom scripts for data exploration and analysis.

•Designed data organization standards and lifecycle management policies for large research datasets.

•Collaborated with research scientists to translate analytical requirements into scalable data solutions.

•Optimized storage and query costs while maintaining accessibility and performance.

-Customer Data Platform Modernization and Real-Time Analytics

•Designed and developed more than 60 dbt models to standardize customer data across transactional and operational systems.

•Implemented Slowly Changing Dimension (SCD Type 2) methodologies to preserve historical customer attribute changes and improve analytical accuracy.

•Built scalable ELT workflows that reduced data refresh latency from 24 hours to 15 minutes.

•Collaborated with business analysts and stakeholders to define data marts and reporting requirements.

•Optimized transformation logic and SQL models to support over 200 active business users.

•Created reusable testing and documentation standards within dbt to improve maintainability and governance.

•Supported self-service analytics initiatives by providing curated datasets and semantic business definitions.

-Enterprise Big Data Processing and Cost Optimization Platform

•Developed and maintained distributed ETL pipelines using PySpark on AWS EMR.

•Processed terabytes of structured and semi-structured data from multiple enterprise sources.

•Tuned Spark execution plans through partition optimization, caching strategies, and broadcast joins.

•Reduced daily batch processing runtime by 65%, decreasing execution time from 8 hours to 2.8 hours.

•Lowered monthly infrastructure costs by approximately $15,000 through resource optimization and workload tuning.

•Integrated Hadoop ecosystem technologies with existing enterprise ETL workflows.

•Implemented monitoring and logging solutions to improve operational visibility of production jobs.

•Worked closely with data consumers to improve data availability and delivery SLAs.

-Enterprise Data Quality and Observability Framework

•Designed and implemented a comprehensive data quality framework using Great Expectations.

•Developed over 300 automated validation checks covering completeness, uniqueness, consistency, and business rules.

•Monitored data quality across more than 80 critical production tables.

•Integrated automated alerting and incident management workflows through PagerDuty.

•Implemented schema drift detection to proactively identify upstream source changes.

•Improved production pipeline reliability to 99.5% uptime.

•Built operational dashboards and reporting tools to track quality metrics and SLA compliance.

•Collaborated with engineering and analytics teams to establish data governance standards. Data Analyst, Latori 07/2018 – 06/2021 Germany

-Enterprise Business Intelligence and Executive Reporting

•Collected, cleaned, and analyzed structured and unstructured data from multiple business sources.

•Developed executive dashboards and reporting solutions using Tableau and Excel.

•Defined and tracked key performance indicators across sales, operations, and customer-focused initiatives.

•Automated recurring reporting processes to improve reporting efficiency and accuracy.

•Delivered actionable insights that contributed to approximately $1M in annual cost savings.

•Collaborated with business stakeholders to translate reporting requirements into analytical solutions.

-Sales Analytics and Revenue Optimization Initiative.

•Built statistical models on large datasets to evaluate customer purchasing behavior and product performance.

•Performed exploratory data analysis to identify trends, patterns, and revenue opportunities.

•Developed recommendations that increased online sales by up to 10% at the product level.

•Partnered with data scientists to prepare and analyze datasets for predictive modeling initiatives.

•Contributed analytical insights that helped improve sales performance by 20%.

•Presented findings and recommendations to business stakeholders and leadership teams.

-Analyzed stockroom and operational performance data to identify inefficiencies and optimization opportunities.

•Evaluated inventory movement, storage utilization, and operational workflows.

•Delivered recommendations that reduced operating costs by 15%.

•Worked closely with operations teams to understand customer demand patterns and business requirements.

•Developed analytical reports supporting inventory planning and operational decision-making.

•Monitored operational KPIs and measured improvement initiatives over time. EDUCATION

Bachelor of engineering, National University of Singapore 08/2011 – 06/2015

Contact this candidate