Post Job Free
Sign in

HPC Network Engineer - Ashburn, VA

Company:
Experis
Location:
Sterling, VA, 20146
Posted:
July 03, 2025
Apply

Description:

Responsibilities to include but not be limited to the following

• Designing and deploying HPC clusters consisting of high-performance servers, interconnected by high-speed networks utilizing RoCE / RDMA over Converged Ethernet.

RoCE Responsibilities

• Network Design and Configuration: Designing and configuring RoCE networks, including switches, network adapters, and Ethernet fabrics, to provide low-latency, high-bandwidth communication for RDMA traffic. Optimizing network settings such as MTU (Maximum Transmission Unit), buffer sizes, and flow control parameters to maximize RoCE performance.

• Congestion Management: Implementing congestion management mechanisms, such as Priority Flow Control (PFC) and Data Center Bridging (DCB), to prevent congestion and ensure fair allocation of network resources. Monitoring network traffic and congestion levels to dynamically adjust congestion control settings and avoid performance degradation.

• Routing and Switching Optimization: Configuring RoCE-aware switches and routers to support RDMA traffic and enable efficient routing of packets between endpoints. Tuning switch port settings, forwarding tables, and routing protocols to minimize packet loss and maximize throughput for RoCE traffic.

• Performance Monitoring and Tuning: Monitoring RoCE network performance metrics, such as latency, throughput, and packet loss, using tools like Ethernet Performance Monitoring (EPM) and InfiniBand Performance Monitoring (IPM). Analyzing performance data to identify bottlenecks, optimize network configurations, and fine-tune RoCE parameters for optimal performance.

• Security and Authentication: Implementing security measures, such as MACsec (Media Access Control Security) and IPsec (Internet Protocol Security), to encrypt and authenticate RDMA traffic over RoCE networks. Enforcing access controls and certificate-based authentication to ensure secure communication between RoCE endpoints.

• Vendor Management: Coordinating with hardware and software vendors to ensure compatibility and support for products in multi-vendor environments. Developing Billing of Materials. Clearly define technical requirements, including performance, scalability, compatibility, and specific features needed for RoCE. Assess the technical specifications, performance benchmarks, and compatibility with existing infrastructure. Implement a PoC to test the switches in a controlled environment and ensure they meet performance 7 and reliability expectations. Evaluate the vendor’s technical support capabilities, including responsiveness, expertise, and available resources. Maintain regular communication with the vendor to stay informed about product updates, potential issues, and upcoming changes. Schedule periodic meetings to review performance/bugs, discuss any concerns, and plan for future needs.

Recommended Qualifications

• Bachelor’s Degree in Computer Science, Information Technology, or related field: A solid educational foundation in computer science or IT is essential for understanding networking principles and protocols.

• Proficiency in RoCE (RDMA over Converged Ethernet) protocols, including RoCEv2 and related standards.

• Experience in designing and configuring high-performance networks utilizing RoCE-enabled Ethernet networks. Knowledge of fabric design principles, topology optimization, and performance tuning techniques.

• Ability to analyze network performance metrics, diagnose bottlenecks, and optimize network configurations for low latency and high throughput. Experience in tuning switch port settings, buffer sizes, and flow control parameters to maximize RoCE performance.

• Familiarity with security measures for RoCE networks, including subnet partitioning, encryption, and access controls. Knowledge of authentication mechanisms and cryptographic protocols for securing RDMA traffic.

• Proficiency in network monitoring tools and techniques for monitoring RoCE network health and performance. Ability to troubleshoot network issues, diagnose connectivity problems, and resolve performance-related issues.

• Certification programs offered by vendors for related RoCE technologies.

• Hands-on experience in deploying, managing, and optimizing high-performance computing (HPC) environments and data center networks. Experience working with RDMA-enabled applications and parallel computing frameworks (e.g., MPI, OpenMP).

• Experience in implementing and troubleshooting complex network configurations, including switches, gateways, and RoCE adapters.

Additional Nice to Have’s:

• Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.

• CCIE, or similar

• Ability to work efficiently on multiple projects and under pressure

• Previous experience with network equipment vendor products (e.g., Juniper, Cisco, Arista, OEM).

• Working knowledge of stateful and stateless firewalls

• Comfortable with Linux or other UNIX implementations, with scripting skills.

• Experience with python scripting / ansible for scripting and automation

• Ability to “read code” as source documentation

• DevOps CI/CD mindset for automation and scale

Apply