Machine Learning Engineer – Speech & Audio AI
Location: San Francisco, CA (Hybrid)
Employment Type: Full-time
Experience Level: Mid to Senior
Are you passionate about shaping the future of voice and sound technology? Join a cutting-edge AI startup in San Francisco that’s building the next generation of speech and audio intelligence products.
We're looking for a Machine Learning Engineer who enjoys solving complex problems and working across multiple areas of AI and data-driven technology in a dynamic environment.
What You’ll Do
Design, train, and optimize ML models for speech recognition, audio classification, speaker diarization, or text-to-speech (TTS).
Collaborate with product and research teams to bring state-of-the-art models into production.
Develop scalable pipelines for model training, evaluation, and deployment.
Apply techniques like self-supervised learning, transformers, or diffusion models to real-world audio data.
Analyze and clean large-scale voice datasets (structured and unstructured).
Monitor and improve inference performance in real-time audio systems.
What We’re Looking For
2–6 years of experience in machine learning, with a focus on speech/audio.
Strong background in deep learning (PyTorch or TensorFlow).
Hands-on experience with tools and frameworks such as:
Hugging Face Transformers
torchaudio, librosa, Kaldi, ESPnet
Neural vocoders (e.g., WaveGlow, WaveNet, HiFi-GAN)
Voice conversion frameworks (e.g., RVC, DiffVC, YourTTS)
TTS engines like Coqui TTS
Self-supervised learning tools like S3PRL
Solid understanding of digital signal processing and acoustic modeling, with experience in: FFmpeg, SoX, NumPy/SciPy, Praat
Experience deploying ML models in cloud environments (AWS, GCP, or Azure).
BS or MS in CS, EE, ML, or related field (or equivalent industry experience).