Post Job Free
Sign in

Cluster Engineer- Deep Learning

Company:
Sustainable Talent
Location:
Santa Clara, CA
Posted:
April 13, 2024
Apply

Description:

Are you ready to make your mark in the forefront of technological innovation? As an HPC Cluster Engineer, you'll play a pivotal role in shaping the future of AI, deep learning, and machine learning initiatives. Join us and leverage Nvidia's cutting-edge GPU technology to drive groundbreaking discoveries and revolutionize industries.

Sustainable Talent is thrilled to partner with Nvidia, a global powerhouse with over 25 years of trailblazing advancements in computer graphics, gaming, and accelerated computing.

This is a W-2 full-time contract based in Santa Clara, CA - Hybrid work option. We offer competitive pay based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture!

Additional locations: MA, Westford; US, NC, Durham; US, TX, Austin.

What you'll be doing:

You'll lead the charge in optimizing our Infiniband network and managing Lustre and GPFS storage solutions, ensuring seamless performance for our cutting-edge initiatives.

Your expertise in the SLURM job scheduler will be instrumental in orchestrating the smooth operation of our clusters, from scheduling tasks to managing resources efficiently.

As a Linux sysadmin guru, you'll be responsible for maintaining the stability and security of our systems, leveraging your deep understanding of Linux environments.

Harnessing the power of Ansible, you'll automate routine tasks and streamline operations, freeing up time for innovation and optimization.

Advanced python and bash scripting will drive automation efforts and enable dynamic solutions to complex challenges.

What We Need to See:

Demonstrated experience with SLURM, coupled with a solid understanding of Infiniband networks and Lustre/GPFS storage systems, is essential.

A proven track record in Linux system administration, ensuring robustness and security in our computing environment.

Proficiency in Ansible is a must-have, enabling you to automate tasks and workflows efficiently.

Strong scripting abilities in Python and bash are critical for developing custom solutions and optimizing cluster performance.

Ways to Stand Out From the Crowd:

Showcase your knowledge of best practices in HPC cluster operations, automation, and upgrades, setting you apart as a seasoned professional in the field.

Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.

Apply