Post Job Free
Sign in

Site Reliability Engineer - GPU Infrastructure

Company:
Genmo
Location:
San Francisco, CA, 94118
Posted:
September 23, 2025
Apply

Description:

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI.

Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

What You'll Do * Own the design and day to day operation of GPU clusters that train and serve frontier generative models.

* Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi cluster federation.

* Define and implement Infrastructure as Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

* Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.

* Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

* Optimize high performance networking (InfiniBand/RDMA) and debug perf bottlenecks.

* Run and continuously improve the 24 7 on call rotation; lead post incident reviews.

* Partner with researchers and engineers, communicate crisply, and ship with a high ownership mindset.

Minimum Qualifications * BS/MS/PhD in CS, EE, or related field.

* 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets.

* Expert level Kubernetes experience.

* Hands on with containerized GPU stacks (nvidia container toolkit, GPU Operator) * GPU schedulers such as Slurm or Kueue.

* Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible). * Track record of shipping and operating large scale infrastructure with high reliability and clear communication.

Nice to Have * Multi cluster / multi cloud (AWS, GCP, Azure, bare metal) production experience.

* Familiarity with CI/CD tooling (GitHub Actions, BuildKit). * Prior work with distributed training, model serving patterns, or other ML/GPU workloads.

Machine learning depth is a plus-not a prerequisite.

We'll help you level up if needed.

Genmo is an Equal Opportunity Employer.

Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law.

Genmo, Inc.

is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.

Apply