Title: High Performance Computing (HPC) administrator - Remote
Length: Long term
Restriction: W2 or C2C
Webcam interview; *** Long term project *** Remote; due to security only US Citizens
high performance computing (HPC) administrator to support our machine learning (ML) platforms and ecosystems. This candidate would support a product team including data scientists and machine learning engineers to support ML products and services for our Manufacturing and Engineering customers. The role includes the running and maintenance of the current environment including incident and change management as well as capacity and release management associated with the development and execution of the ML solutions. A successful candidate will have a solid understanding of operation systems and databases as well as some exposure with HPC environments utilizing graphics processing units (GPU). An understanding of basic programing skills covering UNIX Shell scripting and Python is also required.
Assists in the day-to-day operations including incident management working w/the team to resolve the associated issues with the infrastructure and systems.
Develop and maintain automation and orchestration software and scripting to assist with Machine Learning code release management.
Prioritize and efficiently manage deployment and configuration tasks
Support of the containerized ecosystem
Workflow processes for
Data ingest (Python)
API provisioning service
Watchdog jobs monitoring for data and performing data ingest
Support of the hardware specific configurations, tuning, GPU support
Experience with enterprise standard Database systems (i.e. Oracle, Microsoft SQL Server) Experience at working both independently and in a team-oriented, collaborative environment is essential
Functional understanding of networking concepts; routing, switching, firewalls, load balancers, proxy services, & protocols (TCP, UPD, HTTP, TLS, SIP, SMTP, SNMP, LDAP)
Customer service focused with excellent communication (written and verbal) and interpersonal skills. Must be able to effectively work with customers, coworkers, vendors and management
Experience in Python, PL/SQL
Experience in Unix shell scripting, PowerShell, and Bash
Working knowledge of containers and/or orchestration platforms (i.e. Docker, Singularity, Kubernetes, Rancher)
Prior application development experience using tools such as Jenkins, Gradle/Ant, SVN/Git, Artifactory, Automation
Understanding of computer HW and architecture specifically high performance compute (HPC) utilizing GPUs
Familiarity with agile methodology, ideally Scaled Agile Framework (SAFe)