We're looking for a strong DevOps engineer who can help scale and operationalize our infrastructure as the platform grows. This is not a pure platform-architecture role — the focus is CI/CD, infrastructure automation, deployment reliability, observability, and GPU-oriented workload scaling.
What You'll Own
Improve CI/CD pipelines, deployment workflows, and release reliability
Standardize infrastructure and deployment patterns across environments
Improve observability through logging, metrics, tracing, dashboards, and rollout monitoring
Partner closely with backend engineering on:
deployment strategies
infrastructure automation
environment consistency
migration workflows
possible Kubernetes migration efforts
Support ML-oriented infrastructure as a secondary responsibility:
SageMaker workloads
Ray clusters
GPU scaling patterns
distributed batch execution
autoscaling behavior
runtime/image management
artifact delivery/versioning The Kind of Problems You'll Work On
Deployment safety and rollback strategies
Infrastructure consistency across environments
Release automation and environment promotion flows
Autoscaling and runtime stability
GPU workload orchestration and scaling efficiency
Operational tooling that reduces friction for engineering teams Stack
AWS
Terraform
Docker
Kubernetes
CI/CD systems
SageMaker
Ray
GPU compute infrastructure You'll Probably Do Well Here If
You've operated production infrastructure at meaningful scale
You're strong in practical DevOps execution and operational reliability
You care about automation, observability, and deployment safety
You're comfortable improving developer workflows and infrastructure tooling
You've worked with distributed systems or GPU-oriented workloads before