Post Job Free
Sign in

MLOps / AI Infrastructure Engineer

Company:
Brillfy Technology Inc.
Location:
Irving, TX
Posted:
April 08, 2026
Apply

Description:

Job Title: - MLOps / AI Infrastructure Engineer

Location: - Remote (Periodic travel to client sites – primarily government/on-prem environments)

Duration: - Full-Time / Permanent

Work Hours: - Standard business hours (flexible, async-first environment)

Job Overview & Description:

We are seeking a highly skilled MLOps / AI Infrastructure Engineer to design, deploy, and operate mission-critical AI infrastructure in on-premises and edge environments. This role focuses on building and maintaining GPU-powered compute clusters, Kubernetes-based orchestration systems, and scalable MLOps pipelines for AI training and inference workloads.

The ideal candidate will have deep expertise in GPU infrastructure (NVIDIA H200/A100), Kubernetes on bare-metal, high-performance networking, and software-defined storage, along with hands-on experience in deploying secure, compliant systems in air-gapped and government-regulated environments.

You will collaborate closely with AI/ML engineers, architects, and client technical teams to ensure high availability, performance, and compliance of AI platforms operating outside traditional cloud environments.

Roles & Responsibilities:

GPU Compute & Infrastructure

• Deploy, configure, and maintain GPU servers (NVIDIA H200, A100)

• Manage CUDA, drivers, firmware, NVLink/NVSwitch topology

• Implement NVIDIA tools (DCGM, MIG, NVIDIA Container Toolkit)

• Monitor hardware health, utilization, and performance

• Automate bare-metal provisioning (PXE/iPXE, MAAS, Foreman)

Kubernetes & Container Orchestration

• Build and manage Kubernetes clusters (kubeadm / Rancher RKE2)

• Configure GPU node pools using NVIDIA GPU Operator

• Implement CNI solutions (Calico, Cilium, SR-IOV)

• Manage ingress, load balancing (MetalLB), and service mesh (Istio/Linkerd)

• Enforce cluster security (RBAC, network policies, secrets management)

MLOps & AI Workloads

• Deploy ML platforms (MLflow, Kubeflow)

• Manage model serving using NVIDIA Triton Inference Server

• Build CI/CD pipelines (ArgoCD, Flux – GitOps approach)

• Optimize GPU utilization for training and inference

• Manage model storage (Ceph, MinIO)

Networking & Storage

• Design high-bandwidth networking (InfiniBand, RoCE v2, Ethernet)

• Configure RDMA for distributed AI workloads

• Deploy software-defined storage (Ceph, Rook, MinIO)

• Implement VLANs, firewall policies, and secure connectivity (VPN)

Security & Compliance

• Implement controls aligned with NIST SP 800-171 / CMMC

• Maintain OS hardening (RHEL/Rocky Linux, Ubuntu CIS benchmarks)

• Automate compliance checks (OpenSCAP)

• Document infrastructure (SSP, diagrams, DR plans)

• Support audits and penetration testing remediation

Required Qualifications:

• 6+ years of infrastructure engineering experience

• 3+ years managing GPU clusters or HPC environments

• Strong expertise in NVIDIA GPU stack (CUDA, DCGM, MIG, NVLink)

• Hands-on Kubernetes (bare-metal deployment & operations)

• Strong networking knowledge (BGP, VLANs, RDMA, load balancing)

• Experience with storage solutions (Ceph, MinIO, Rook)

• MLOps experience (MLflow, Kubeflow, Triton, GitOps pipelines)

• Knowledge of NIST SP 800-171 compliance

• Experience with Terraform or Ansible

• Strong Linux administration skills

• Excellent documentation and communication skills

Certifications (Preferred / Required):

• Certified Kubernetes Administrator (CKA) / Certified Kubernetes Security Specialist (CKS)

• NVIDIA Certifications (GPU / AI Infrastructure)

• RHCSA / RHCE (Red Hat Certified System Administrator/Engineer)

• Security or compliance certifications (preferred)

Apply