Job Description:
- Lead and manage OpenShift platform operations (on-prem and cloud).
- Architect and execute large-scale infrastructure projects (e.g., cluster rebuilds, expansions).
- Collaborate with stakeholders and vendors to ensure platform reliability and scalability.
- Develop automation tools and CI/CD pipelines to support platform and application teams.
- Monitor and optimize cluster health, performance, and capacity across environments.
- Implement security best practices, including RBAC, encryption, and regular patching.
- Automate cluster provisioning, scaling, and updates using tools like Ansible, Helm, and Terraform.
- Implement backup strategies, disaster recovery plans, and ensure high availability.
- Diagnose and resolve cluster and application issues in a timely manner.
- Create scripts and dashboards to improve observability and operational efficiency.
- Provide Level 1 support and contribute to incident response and production readiness.
- Define and promote DevOps best practices across delivery teams.
- Support cloud operations and infrastructure initiatives across various teams.
- Provide support for BuildMaster upgrades and maintenance.
- Provide support for other container based DevOps solutions (ex. GitHub ARC)
Equipment Requirements:
- The candidate will require own equipment .
Mandatory Training Courses:
- Once hired the candidate will be required to complete all mandatory training which includes but is not limited to FOIP, Security/Cybersecurity, Information Management, and Respect in the Workplace.
- There may also be other mandatory and/or optional training.
Must Have:
- Experience with research, analysis and problem solving. 7 years
- Demonstrated experience in stakeholder engagement and vendor coordination, including technical leadership in cross-functional teams. 2 years
- Experience architecting, managing, and maintaining Azure Red Hat OpenShift clusters. 5 years
- Experience designing and maintaining production-grade monitoring and alerting systems, with a focus on production readiness and proactive incident detection 7 years
- Experience documenting technical issues, architectural decisions, and operational procedures. 7 years
- Experience Installing and Administering BuildMaster. 3 years
- Experience leading infrastructure projects, such as OpenShift cluster expansions, rebuilds, and migrations. 5 years
- Experience managing production release readiness and incident response, including go/no-go decisions, pre-deployment validation, triage, escalation, and post-incident reviews. 3 years
- Experience providing general cloud consultancy for Azure, including best practices, cost optimization, governance, and operational support. 2 years
- Experience with basic security awareness such as secrets management and the principle of least privilege. 5 years
- Experience with command line interfaces for command execution and file system traversal. 7 years
- Experience with creating and maintaining BuildMaster workflows including any configuration components. 5 years
- Experience with IaC tools (ex. Terraform) 3 years
- Experience with networking tools like DNS lookups, Trace Routes and load balancing. 7 years
- Experience with public sector or compliance-heavy environments (e.g., NIST, DoD, PCI-DSS). 3 years
- Hands-on experience architecting, managing, and maintaining on prem Red Hat OpenShift clusters, including lifecycle operations such as provisioning, scaling, upgrades, and recovery—with a focus on platform-level responsibilities, not application deployment. 5 Years
Nice to Have:
- Experience with disaster recovery testing 2 years
- Experience working in or for the public sector. 2 years
- Strong domain knowledge in cloud operations and platform engineering 2 years