Post Job Free
Sign in

HPC Systems Engineer

Company:
Children's National Hospital
Location:
Washington, DC, 20022
Posted:
June 30, 2025
Apply

Description:

The Developing Brain Institute is seeking a highly qualified and motivated HPC Systems Engineer to join the high functioning and dynamic team.

The Systems Engineer contributes to the strategic planning, design, testing, organization and implementation of cutting-edge technology projects for the Institute.

The systems engineer is responsible for the day-to-day administration of HPC clusters, High Performance storage systems, backups, networking, security and any other services related to the operation of a HPC environment.

The successful candidate will have experience in similar roles in high performance computing (HPC) labs or university settings.

Specific Duties & Responsibilities Systems Engineering, Administration, Security, and Oversight Work with PIs and Engineers to design, organize, plan, test and implement cutting-edge hardware designs for an HPC environment.

Extensively document systems processes so staff can perform routine tasks and provide backup.

Provides stable solutions for HPC resources.

Maintain extensive monitoring systems to facilitate quick, proactive responses to routine failures, and to provide comprehensive performance data logging.

Provide general system administration backup and escalation for other staff.

Install and maintain user workstations that serve as gateway to HPC Ensure resources meet the community's needs and are highly available to the group with limited interruption.

Manage inventory of licenses and software/hardware purchases in coordination with respective vendors.

Implement network configuration and security measures to assure effective utilization of resources.

Understand HPC technical needs.

Work closely with the Manager and team to successfully implement policies and procedures.

Create and maintain a stable, secure operating system and software environment, which continues to meet users' evolving research needs.

Implement and maintain secure measures to protect data subject to restrictions.

Manage data access restrictions on a per user and group basis.

Implement and maintain monitoring measures for data and system access.

Other Systems Tasks assigned by supervisor.

Technological Research Offer technical advice on new projects that directly involve HPC computing.

Develop custom tools where necessary and contribute useful creations back to open-source development efforts where appropriate.

Implement and test new technologies that could be beneficial to HPC.

Special Knowledge, Skills, & Abilities: Proven experience deploying large-complex scale HPC projects.

Proven experience across multiple technologies with background in applications, databases, middleware, etc.

In-depth knowledge of the design and organization of cutting-edge technology in HPC environments.

In-depth understanding of HPC Cluster hardware and management software.

Understanding of massive high performance parallel storage and methodologies.

Expert knowledge of Unix/Linux systems administration, including all aspects of management, monitoring, performance analysis, and integration in potentially complex heterogeneous environments.

Knowledge of networking, high speed interconnects, and network security principles in an HPC environment.

Use of configuration management tools (e.g.

Bright) to help maintain large-scale Linux clusters, supercomputers, storage systems (50+ TB), and smaller systems.

Understand, implement, troubleshoot, and support job scheduling, resource management and workload management systems, including diagnosis of failed jobs, implementation of policies, and investigations of new features and services.

Understand and support hierarchical file system infrastructure, software and services, including high performance parallel storage, backup systems, and robotic tape libraries.

Develop reports and customize tools that automate the monitoring process of critical systems and alert team of issues automatically.

Evaluate, implement and manage appropriate high level complex software and hardware solutions by using best practices for the environment to ensure system integrity.

Install and configure infrastructure applications by following the industry's best practices to deliver effective solutions.

Maintain an effective schedule for systems backups and archive operations for mission critical systems.

Audit and maintain user access, authorization and authentication.

Generate periodic reports on resource utilization.

Maintain resource inventory using best practice applications.

Advanced knowledge of Linux, Apache, SQL, PHP/Python/Perl (LAMP) technology/toolkits.

Ability to handle high priority escalations whenever necessary Ability to multitask while managing time and priorities Troubleshoot and solve difficult system issues as they arise.

Must be adaptable and able to meet conflicting deadlines.

Exceptional organizational skills.

Maintain effective and thorough documentation of all configuration and tasks performed.

Ability to automate systems administration tasks wherever possible.

Excellent oral and written interpersonal skills.

Ability to meet the physical requirements of the position.

Keep up to date on emerging technologies.

Research, recommend, and implement new technologies based on their value to the research team.

Ability to maintain confidentiality.

Excellent customer service skills.

Excellent communication skills Must demonstrate strong critical thinking and analytical reasoning.

Minimum Qualifications Bachelor's Degree.

Five years related experience.

Additional education may substitute for required experience and additional related experience may substitute for required education, to the extent permitted by the department.

Preferred Qualifications Seven (7) years experience managing Linux servers, with direct experience managing HPC clusters.

Experience as a high-level Linux system administrator.

Experience managing mission critical services.

Familiarity with configuration of the HPC software stack, including MPI, OpenMP, Intel, and GNU compilers, Math libraries.

Experience with open-source software compilation.

In-depth knowledge of TCP/IP networking and related protocols, InfiniBand, etc.

Experience with scientific application management packages like pymodules, modules.

Excellent scripting skills, python, perl, shell.

Programming skills in C, C++, or scientific language, desired but not required Experience with MySQL or Mariadb database programming, desired but not required.

Expert level knowledge of configuration management and monitoring tools (puppet, nagios, etc). Experience configuring resource manager applications (like SLURM). Experience with Apache administration.

Knowledge of scientific software applications in academic supercomputing environments.Familiarity or experience with data subject to restrictions, desired but not required.

Apply