Description

Job Description:

  • Lead and manage OpenShift platform operations (on-prem and cloud).
  • Architect and execute large-scale infrastructure projects (e.g., cluster rebuilds, expansions).
  • Collaborate with stakeholders and vendors to ensure platform reliability and scalability.
  • Develop automation tools and CI/CD pipelines to support platform and application teams.
  • Monitor and optimize cluster health, performance, and capacity across environments.
  • Implement security best practices, including RBAC, encryption, and regular patching.
  • Automate cluster provisioning, scaling, and updates using tools like Ansible, Helm, and Terraform.
  • Implement backup strategies, disaster recovery plans, and ensure high availability.
  • Diagnose and resolve cluster and application issues in a timely manner.
  • Create scripts and dashboards to improve observability and operational efficiency.
  • Provide Level 1 support and contribute to incident response and production readiness.
  • Define and promote DevOps best practices across delivery teams.
  • Support cloud operations and infrastructure initiatives across various teams.
  • Provide support for BuildMaster upgrades and maintenance.
  • Provide support for other container based DevOps solutions (ex. GitHub ARC)

Equipment Requirements:

  • The candidate will require own equipment .

Mandatory Training Courses:

  • Once hired the candidate will be required to complete all mandatory training which includes but is not limited to FOIP, Security/Cybersecurity, Information Management, and Respect in the Workplace.
  • There may also be other mandatory and/or optional training.

Must Have:

  • Experience with research, analysis and problem solving. 7 years
  • Demonstrated experience in stakeholder engagement and vendor coordination, including technical leadership in cross-functional teams. 2 years
  • Experience architecting, managing, and maintaining Azure Red Hat OpenShift clusters. 5 years
  • Experience designing and maintaining production-grade monitoring and alerting systems, with a focus on production readiness and proactive incident detection 7 years
  • Experience documenting technical issues, architectural decisions, and operational procedures. 7 years
  • Experience Installing and Administering BuildMaster. 3 years
  • Experience leading infrastructure projects, such as OpenShift cluster expansions, rebuilds, and migrations. 5 years
  • Experience managing production release readiness and incident response, including go/no-go decisions, pre-deployment validation, triage, escalation, and post-incident reviews. 3 years
  • Experience providing general cloud consultancy for Azure, including best practices, cost optimization, governance, and operational support. 2 years
  • Experience with basic security awareness such as secrets management and the principle of least privilege. 5 years
  • Experience with command line interfaces for command execution and file system traversal. 7 years
  • Experience with creating and maintaining BuildMaster workflows including any configuration components. 5 years
  • Experience with IaC tools (ex. Terraform) 3 years
  • Experience with networking tools like DNS lookups, Trace Routes and load balancing. 7 years
  • Experience with public sector or compliance-heavy environments (e.g., NIST, DoD, PCI-DSS). 3 years
  • Hands-on experience architecting, managing, and maintaining on prem Red Hat OpenShift clusters, including lifecycle operations such as provisioning, scaling, upgrades, and recovery—with a focus on platform-level responsibilities, not application deployment. 5 Years

Nice to Have:

  • Experience with disaster recovery testing 2 years
  • Experience working in or for the public sector. 2 years
  • Strong domain knowledge in cloud operations and platform engineering 2 years

Education

Any Graduate