Description

· Collaborate closely with engineering teams (system architects, hardware/software engineers, QA, and more) to design, develop, debug, and release next-generation products.

· Manage and maintain a high-performing Compute Farm of builders, packagers, testers, and core infrastructure.

· Ensure availability targets are consistently met and lead system recovery efforts.

· Deploy and qualify systems while supporting exciting new technology bring-ups.

· Oversee inventory and lifecycle management for NVIDIA's assets across data centers and labs.

· Gather critical metrics and create Standard Operating Procedures (SOPs) documentation.

· Maintain a world-class, safe, and well-organized environment in our data centers and labs.

· Troubleshoot Linux/Windows, hardware, and infrastructure issues alongside engineers and platform operations teams.

· Plan, deploy, and maintain on-premises private cloud infrastructure, collaborating with datacenter and network engineering teams.

· Implement efficiency improvements to maximize availability, throughput, and test accuracy while meeting SLAs and KPIs.

· Represent the team in meetings with internal stakeholders and contribute to global operations.
 

What We Need to See:

· Associate’s or Bachelor’s Degree in Engineering/Technical Major (or equivalent experience).

· 5+ years of experience in data centers or large engineering labs.

· Familiarity with SCMs like GIT/Perforce.

· Proficiency in DCIM (Nautobot, etc.) and scripting (shell, Python, Ansible).

· Working knowledge of protocols/services like TCP/IP, DNS, NFS, SSL, etc.

· Experience with Windows, Linux, and Mac operating systems.

· Hands-on experience with PCBs, GPUs, and system deployments.

· Exceptional communication skills, both written and verbal.

· Ability to explain technical concepts to non-technical audiences.

· Strong problem-solving skills and a collaborative spirit.
 

What Makes You Stand Out:

· Experience managing HPC clusters using tools like BCM and Slurm.

· Hands-on knowledge of OpenStack.

· Relevant certifications such as CCNA or equivalent.

· Strong background in Windows and Linux administration, with an understanding of dense datacenter design, including compute, storage, and networking.

· Experience with hypervisors and VM applications.

· Knowledge of DC infrastructure with an emphasis on liquid cooling.

· A track record of technical curiosity and innovation.

· Mechanically inclined and comfortable with tools and physical tasks.

· Energetic, enthusiastic, and the understanding of what it takes to get the team to the finish line.

· Willing to go the extra mile to get the job done! 

· This is an onsite contract position and will require local travel to DCs within Santa Clara. 


Additional Notes:
 

· Team/Culture Fit Profile:

· High level of cross collaboration with sister Enterprise teams (cross collaboration experience is a must!)

· Ability to take initiative and ownership of your work

· Strong verbal and written communication skills

· Ability to work independently, critical thinking skills, team player

 

· Qualifications/Key Responsibilities:

· 5+ years of experience working in a data center/lab environment

· Associate’s or Bachelor’s Degree in Engineering/Technical Major (manager prefers bachelor’s degree)

· Scripting/automation expertise

· Team focuses on early product development

· Strong coordination skills in the R&D space

· Experience managing HPC clusters (Slurm++)

· 2-3+ years of script building (Linux based)

· Working experience with process development and driving tasks to completion

· Scrum, Agile

 


 

Education

Bachelor's degree