Post Job Free
Sign in

Senior MLOps Engineer

Company:
XRI GLOBAL
Location:
Clarksville, TN, 37040
Pay:
130000USD - 170000USD per year
Posted:
May 22, 2025
Apply

Description:

Job Description

Benefits:

401(k)

401(k) matching

Competitive salary

Paid time off

Parental leave

Job Title: Senior MLOps Engineer

About XRI

XRI is an AI company specializing in research, data, and development for low resource languages. We are dedicated to enabling speakers of low resource languages to flourish through the development and deployment of advanced language technology solutions.

About the role

Were looking for the engineer who will own our entire AI backend stackdesign decisions, uptime, roadmap, and results. Youll inherit a production Kubernetes inference cluster (GPU-backed AKS), a Python/Flask API that handles authentication, Stripe billing, and API-key metering, and a growing ML R&D pipeline for new models. The mandate is broad: keep everything humming today while architecting what it should look like 12 months from now. Well give you budget, support, and plenty of autonomythen count on you to drive impact.

Responsibilities

Maintain and scale AI backend (AKS, PyTriton, Flask APIs).

Develop APIs with auth, billing, and usage tracking.

Enable distributed training on Azure ML with PyTorch DDP/FSDP.

Build tools for research experimentation and internal efficiency.

Implement CI/CD with GitHub Actions, Docker, and Terraform.

Lead observability with Prometheus, Grafana, and SLOs.

Drive architecture decisions and roadmap execution.

Qualifications

Masters degree in CS, ML, or related field.

5+ years of experience developing backend infrastructure.

Proven expertise with Azure Kubernetes Services (AKS).

Expert in backend infrastructure and Azure Kubernetes.

Demonstrated proficiency in PyTorch, Python, Flask, Stripe integration, and RESTful APIs.

Skills

Advanced Python (3.10+), with production experience using Flask or FastAPI.

Experience with CI/CD pipelines using GitHub Actions and Docker.

Familiarity with monitoring tools (Prometheus, Grafana).

Working knowledge of Terraform, API security best practices, OpenAPI specifications.

Bonus: Experience with Redis, ReactJS, TypeScript, and Kafka.

Goals and Objectives

Short-term: Improve ML systems, set data standards, stabilize backend, and build CI/CD.

Long-term: Develop scalable infrastructure, crowd-sourced data systems, and support internal adoption.

Collaboration

Reports to Head of ML, works with PMs and broader engineering team.

Mentor colleagues and engage in cross-functional teams.

KPIs

Inference uptime, cost-efficiency, and latency.

Tool adoption, incident resolution, and observability metrics.

Language & Travel

Fluent in English; additional languages a plus. Willing to travel internationally for field use, offsites, and events.

Culture and Values

XRI values innovation, compassion, problem-solving, and flexibility. Ideal candidates work autonomously, think strategically, and support team growth through knowledge-sharing.

Full-time

Fully remote

Apply