Overview
We are seeking an experienced Senior Systems Engineer with US Government Top Secret/SCI security clearance with Polygraph to support a small standalone system dedicated to high-performance computing (HPC) and artificial intelligence (AI) workloads. This role demands a blend of operational expertise and strategic technical vision, focusing on the management and optimization of our standalone HPC/AI system. The ideal candidate will manage the technical operation of our infrastructure, develop standardized procedures for hardware, network, and software management across the system, and expertly oversee cluster management (including provisioning, optimization, and monitoring of clustered resources for HPC/AI workloads, such as NVIDIA BCM). What will you do?
This position requires broad expertise in HPC/AI system administration, with a focus on:
Refining infrastructure management frameworks
Traditional infrastructure management (hardware, networking, directory services)
Modern HPC/AI support (Linux/Ubuntu, Proxmox, NVIDIA BCM, WEKA storage)
Designing scalable, secure, and highly available system architectures Do you have what it takes?
Active TS/SCI with Polygraph required.
Bachelor’s degree in Engineering, Computer Science, Software Engineering, or related field.
7+ years’ experience in systems engineering or related field
Operating Systems & Infrastructure:
Expert-level Linux systems engineering
Windows client operating systems deployment/maintenance
Linux (Ubuntu) server operating systems deployment/maintenance
Hardware & Networking:
Server hardware
Network hardware, wiring, and switching configurations
Virtualization & Containerization:
Virtualization (ideally Proxmox)
Containerization (ideally Docker/Podman with Ray or Kubernetes)
Management & Orchestration:
Directory services and PKI infrastructure deployment/maintenance
Configuration management (ideally Ansible, Puppet, Chef, or DSC)
Cluster orchestration (ideally NVIDIA Base Cluster Management (BCM))
Development Support & Software Management:
Development support services (Gitlab, Jenkins, Nexus)
Operating system software repository synchronization (Apt, Snap, Yum)