Site Reliability Engineer

Company:

Rethink recruit

Location:

San Francisco, CA

Posted:

November 04, 2025

Apply

Description:

About Runloop

Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform eliminates friction in environment setup and dependencies, enabling teams to experiment, iterate, and deploy seamlessly. We're a small but dedicated team working to deliver a rock-solid platform that empowers innovation.

The Role

We're looking for a skilled Site Reliability Engineer (SRE) to ensure the reliability, observability, performance, and security of our core platform-the foundation upon which our users build. You'll work closely with engineering to maintain resilient systems that power our code sandboxes, while mentoring peers on reliability practices. This role blends deep operational expertise with a software engineering mindset.

What You'll Do

Design, operate, and improve production infrastructure on AWS, GCP, or Azure.

Define and monitor SLIs/SLOs, manage error budgets, and maintain observability with Prometheus, Grafana, and logging/tracing frameworks.

Build automation for deployments, scaling, and recovery-reducing toil and creating self-healing systems.

Lead incident response, root-cause analysis, and blameless post-mortems.

Collaborate with developers to design scalable, reliable services.

Optimize distributed systems, networking, and sandbox performance.

Plan for capacity growth and support safe release/change management.

Mentor engineers on reliability and front-end distributed systems (CDNs, caching, client observability). Qualifications

Proven experience as an SRE, DevOps Engineer, or similar role.

Strong programming skills (Python or Go preferred).

Deep knowledge of containerization (Docker, Kubernetes).

Expertise in infrastructure-as-code (Terraform or Pulumi).

Strong understanding of networking, Linux, and system security.

Hands-on experience with distributed systems and observability (metrics, logs, tracing).

Skilled in incident management, on-call rotations, and post-mortem processes.

Ability to mentor and influence best practices across teams. Bonus Points

Experience with chaos engineering, CI/CD for front-end delivery, or observability tools like Sentry, RUM, or synthetic monitoring. Benefits

Competitive salary and equity.

Comprehensive health, dental, and vision insurance for you and your dependents.

Free lunch and snacks.

Opportunity to shape the future of AI-driven software engineering in a high-impact role.

Location

On-site in San Francisco, CA (in office 4 days/week, optional 1 day WFH).

Join Us

If you're passionate about building resilient systems that empower developers and want to shape the future of AI-driven software engineering, we'd love to hear from you. Join Runloop and help build the infrastructure that powers tomorrow's AI.

Runloop is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, sexual orientation, gender identity, or any other characteristic protected by law.

Apply

Site Reliability Engineer

Description:

Report this job