Post Job Free
Sign in

Machine Learning - Model Serving

Company:
Alexander Chapman
Location:
San Francisco, CA
Posted:
June 25, 2025
Apply

Description:

We are working with a company building intuitive, voice-first AI systems that blend natural interaction with powerful model performance. Founded by leaders from Meta, Oculus, and Google, they’re creating a new class of consumer devices powered by speech, vision, and LLMs.

The Role

You’ll help optimize and scale the inference stack, working across model serving, performance tuning, and deployment to support real-time, multimodal AI.

What You’ll Do

Improve serving systems for LLMs, speech, and vision models.

Optimize throughput, latency, and cost using advanced techniques like batching, caching, and kernel tuning.

Extend frameworks like VLLM or SGLang to push the limits of performance.

Collaborate with training teams to deploy faster, lighter models.

Experiment with compilers and hardware backends to boost efficiency.

What We’re Looking For

Strong experience with PyTorch or similar ML frameworks.

Deep knowledge of model serving and systems performance.

Skilled in low-level debugging, bottleneck analysis, and server optimization.

Familiar with VLLM, Ray, or deploying inference workloads at scale.

Comfortable owning complex infrastructure projects end to end.

Background in computer science or related field from a top-tier university (e.g. Stanford, MIT, Ivy League).

Experience at a top tech company (e.g. FAANG) or a successful, high-growth startup.

They’re looking for curious, impact-driven engineers ready to push what’s possible with real-time AI.

Apply