Title: Machine Learning Performance Engineer Location: San Ramon, CA preferred with 50% travel Duration: 6+ Months contract Skills Required: ML+CUDA + Python Description Must be willing to travel to customer sites.
Job Responsibilities include CUDA installation/configuration/tuning issues and slowing down the adoption of the technology.
These experts will help us fix these issues.
Requirements: An understanding of modern ML techniques and toolsets The experience and systems knowledge required to debug a training run's performance end to end Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy Debugging and optimization experience using tools like CUDA GDB, NSight Systems, NSight Compute Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization, and NVLink, and how to use these networking technologies to link up GPU clusters An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools.