Job Description
Benefits:
401(k)
401(k) matching
Competitive salary
Paid time off
Parental leave
Job Title: Senior MLOps Engineer
About XRI
XRI is an AI company specializing in research, data, and development for low resource languages. We are dedicated to enabling speakers of low resource languages to flourish through the development and deployment of advanced language technology solutions.
About the role
Were looking for the engineer who will own our entire AI backend stackdesign decisions, uptime, roadmap, and results. Youll inherit a production Kubernetes inference cluster (GPU-backed AKS), a Python/Flask API that handles authentication, Stripe billing, and API-key metering, and a growing ML R&D pipeline for new models. The mandate is broad: keep everything humming today while architecting what it should look like 12 months from now. Well give you budget, support, and plenty of autonomythen count on you to drive impact.
Responsibilities
Maintain and scale AI backend (AKS, PyTriton, Flask APIs).
Develop APIs with auth, billing, and usage tracking.
Enable distributed training on Azure ML with PyTorch DDP/FSDP.
Build tools for research experimentation and internal efficiency.
Implement CI/CD with GitHub Actions, Docker, and Terraform.
Lead observability with Prometheus, Grafana, and SLOs.
Drive architecture decisions and roadmap execution.
Qualifications
Masters degree in CS, ML, or related field.
5+ years of experience developing backend infrastructure.
Proven expertise with Azure Kubernetes Services (AKS).
Expert in backend infrastructure and Azure Kubernetes.
Demonstrated proficiency in PyTorch, Python, Flask, Stripe integration, and RESTful APIs.
Skills
Advanced Python (3.10+), with production experience using Flask or FastAPI.
Experience with CI/CD pipelines using GitHub Actions and Docker.
Familiarity with monitoring tools (Prometheus, Grafana).
Working knowledge of Terraform, API security best practices, OpenAPI specifications.
Bonus: Experience with Redis, ReactJS, TypeScript, and Kafka.
Goals and Objectives
Short-term: Improve ML systems, set data standards, stabilize backend, and build CI/CD.
Long-term: Develop scalable infrastructure, crowd-sourced data systems, and support internal adoption.
Collaboration
Reports to Head of ML, works with PMs and broader engineering team.
Mentor colleagues and engage in cross-functional teams.
KPIs
Inference uptime, cost-efficiency, and latency.
Tool adoption, incident resolution, and observability metrics.
Language & Travel
Fluent in English; additional languages a plus. Willing to travel internationally for field use, offsites, and events.
Culture and Values
XRI values innovation, compassion, problem-solving, and flexibility. Ideal candidates work autonomously, think strategically, and support team growth through knowledge-sharing.
Full-time
Fully remote