· Collaborate closely with engineering teams (system architects, hardware/software engineers, QA, and more) to design, develop, debug, and release next-generation products.
· Manage and maintain a high-performing Compute Farm of builders, packagers, testers, and core infrastructure.
· Ensure availability targets are consistently met and lead system recovery efforts.
· Deploy and qualify systems while supporting exciting new technology bring-ups.
· Oversee inventory and lifecycle management for NVIDIA's assets across data centers and labs.
· Gather critical metrics and create Standard Operating Procedures (SOPs) documentation.
· Maintain a world-class, safe, and well-organized environment in our data centers and labs.
· Troubleshoot Linux/Windows, hardware, and infrastructure issues alongside engineers and platform operations teams.
· Plan, deploy, and maintain on-premises private cloud infrastructure, collaborating with datacenter and network engineering teams.
· Implement efficiency improvements to maximize availability, throughput, and test accuracy while meeting SLAs and KPIs.
· Represent the team in meetings with internal stakeholders and contribute to global operations.
What We Need to See:
· Associate’s or Bachelor’s Degree in Engineering/Technical Major (or equivalent experience).
· 5+ years of experience in data centers or large engineering labs.
· Familiarity with SCMs like GIT/Perforce.
· Proficiency in DCIM (Nautobot, etc.) and scripting (shell, Python, Ansible).
· Working knowledge of protocols/services like TCP/IP, DNS, NFS, SSL, etc.
· Experience with Windows, Linux, and Mac operating systems.
· Hands-on experience with PCBs, GPUs, and system deployments.
· Exceptional communication skills, both written and verbal.
· Ability to explain technical concepts to non-technical audiences.
· Strong problem-solving skills and a collaborative spirit.
What Makes You Stand Out:
· Experience managing HPC clusters using tools like BCM and Slurm.
· Hands-on knowledge of OpenStack.
· Relevant certifications such as CCNA or equivalent.
· Strong background in Windows and Linux administration, with an understanding of dense datacenter design, including compute, storage, and networking.
· Experience with hypervisors and VM applications.
· Knowledge of DC infrastructure with an emphasis on liquid cooling.
· A track record of technical curiosity and innovation.
· Mechanically inclined and comfortable with tools and physical tasks.
· Energetic, enthusiastic, and the understanding of what it takes to get the team to the finish line.
· Willing to go the extra mile to get the job done!
· This is an onsite contract position and will require local travel to DCs within Santa Clara.
Additional Notes:
· Team/Culture Fit Profile:
· High level of cross collaboration with sister Enterprise teams (cross collaboration experience is a must!)
· Ability to take initiative and ownership of your work
· Strong verbal and written communication skills
· Ability to work independently, critical thinking skills, team player
· Qualifications/Key Responsibilities:
· 5+ years of experience working in a data center/lab environment
· Associate’s or Bachelor’s Degree in Engineering/Technical Major (manager prefers bachelor’s degree)
· Scripting/automation expertise
· Team focuses on early product development
· Strong coordination skills in the R&D space
· Experience managing HPC clusters (Slurm++)
· 2-3+ years of script building (Linux based)
· Working experience with process development and driving tasks to completion
· Scrum, Agile
Bachelor's degree