- Proven ability to orchestrate bare metal linux systems at scale including building automation for firmware updates, bios config management, configuring PXE environments.
- Deep Linux systems experience including troubleshooting network interfaces, developing and applying configuration management, security best practices and monitoring and alerting.
- Strong automation mindset.
- Expert knowledge in 1 or more orchestration tools such as MaaS, Salt, Chef, Ansible or Puppet.
- Strong communication skills.
- Your job will involve writing detailed documentation for others to pick up or leading knowledge sharing sessions with operations teams.
- Bonus skills include:
- Hands-on experience in High Performance Computing (HPC) clustered environments from Nvidia or AMD.
- Experience in performing automated wide scale testing on NCCL or other frameworks.
- Network engineering experience with VyOS platforms.
What You'll Be Working On:
- Provisioning and automating GPU Bare Metal Deployments
- DevOps - Assist customer support and Cloud Ops teams with GPU specific knowledge/debugging during customer escalations.
- Performance testing, analysis and monitoring
- Firmware, BIOS, Kernel Upgrades and Testing