Post Job Free
Sign in

Platform engineer, MLOps - San Francisco, CA (hybrid)

Company:
Jobgether
Location:
San Francisco, CA
Posted:
May 14, 2025
Apply

Description:

Job Description

About Jobgether

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

One of our companies is currently looking for a Platform Engineer, MLOps in San Francisco, CA.

This role is focused on building and maintaining the infrastructure that powers AI/ML development and production environments. You’ll collaborate closely with machine learning engineers and researchers to implement robust CI/CD pipelines, oversee container orchestration systems, and optimize large-scale training and inference workloads. Your work will ensure operational excellence, improve system reliability, and scale Kubernetes clusters supporting GPU-intensive tasks. This is a high-impact position for someone passionate about automating ML workflows and solving infrastructure challenges in fast-paced, high-growth environments.

Accountabilities:

Develop and manage CI/CD pipelines that support safe, reproducible machine learning experiments

Set up and monitor logging, alerting, and observability systems for model training and production APIs

Operate and optimize large Kubernetes clusters for GPU workloads

Manage containerization using Docker and orchestrate deployments via Kubernetes

Ensure high availability of training environments across distributed systems

Support and enhance the performance, scalability, and security of MLOps infrastructure

Troubleshoot complex systems and contribute to reliability improvements across ML platforms

Requirements

5+ years of experience building and managing core infrastructure for large-scale systems

Deep hands-on experience with Kubernetes, Docker, and GPU workload orchestration

Expertise in cloud platforms (GCP, AWS, or Azure) and infrastructure-as-code tools (Terraform)

Proficiency with scripting (Python, Bash) and Git/GitHub workflows

Familiarity with ML frameworks like PyTorch, Huggingface Transformers, TensorRT, and vLLM

Experience with monitoring tools such as Prometheus, Grafana, or equivalent

Strong problem-solving skills and the ability to operate in a dynamic, ambiguous environment

Experience running inference clusters and managing CI/CD pipelines in ML-focused environments

Benefits

Generous paid time off and company holidays

Comprehensive medical, dental, and vision insurance for employees and dependents

12 weeks of paid parental leave

Fertility and family planning support

Early cancer detection screenings through Galleri

Flexible spending accounts (FSA), dependent care FSA, and HSA with company contributions

Annual stipends for home office setup, phone/internet, wellness, and learning & development

Company-wide and team off-site events

Competitive salary, stock options, and 401(k) plan

Jobgether hiring process disclaimer

This job is posted on behalf of one of our partner companies. If you choose to apply, your application will go through our AI-powered 3-step screening process, where we automatically select the 5 best candidates.

Our AI thoroughly analyzes every line of your CV and LinkedIn profile to assess your fit for the role, evaluating each experience in detail. When needed, our team may also conduct a manual review to ensure only the most relevant candidates are considered.

Our process is fair, unbiased, and based solely on qualifications and relevance to the job. Only the best-matching candidates will be selected for the next round.

If you are among the top 5 candidates, you will be notified within 7 days.

If you do not receive feedback after 7 days, it means you were not selected. However, if you wish, we may consider your profile for other similar opportunities that better match your experience.

Thank you for your interest!

#LI-CL1

Full-time

Hybrid remote

Apply