Summary:
Join Meta's rapidly expanding AI Training and Inference Infrastructure team! As we tackle the scaling challenges of supporting diverse AI use cases, your expertise will play a crucial role in evolving our network infrastructure that interconnects numerous training accelerators like GPUs. You will ensure our network meets high-performance standards and maintains availability for RDMA workloads, which demand lossless fabric interconnections with minimal latency. We constantly seek to enhance performance across the stack, from network fabric to host networking, communication libraries, and scheduling infrastructure.
Your Responsibilities:
Lead interdisciplinary teams to develop innovative solutions for large-scale training systems, weighing trade-offs and making informed decisions.
Ensure on-time delivery of milestones through collaboration and teamwork.
Oversee the overall performance of our communication system, including benchmarking, monitoring, and troubleshooting production issues.
Define the technical vision and drive a multi-year roadmap to achieve our objectives.
Collaborate with cross-functional teams, providing insights on AI network architecture, including topologies, transport, and congestion control techniques.
Minimum Qualifications:
Bachelor's degree in Computer Science, Computer Engineering, or a related technical field, or equivalent experience.
Experience in developing, evaluating, and debugging host networking protocols such as RDMA.
Over 10 years of experience designing, deploying, and operating complex networks.
Proven ability to triage performance issues in large-scale distributed applications.
Preferred Qualifications:
Experience developing communication libraries, like Message Passing Interface, NCCL, and UCX.
Strong understanding of AI training workloads and their network demands.
Knowledge of RDMA congestion control mechanisms for InfiniBand and RoCE networks.
Familiarity with the latest AI technologies.
Experience with machine learning frameworks such as PyTorch and TensorFlow.
Proficiency in system software development using languages like C++.
Compensation:
$219,000/year to $301,000/year + bonus + equity + benefits
Industry: Internet
Equal Opportunity:
Meta is an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based on race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, transgender status, age, veteran status, disability status, or other legally protected characteristics. We consider qualified applicants with criminal histories in accordance with applicable laws. Meta participates in the E-Verify program where required by law. Meta may utilize AI and machine learning technologies during the hiring process.
Meta is dedicated to providing reasonable accommodations to candidates with disabilities during recruitment. If you require assistance or accommodations due to a disability, please let us know.