Site Reliability Engineer

Work with a variety of tools and technologies to ensure the reliability and performance of the Managed Services organization.
Drive initiatives to optimize and secure infrastructure operations while contributing to scalability and continuous improvement efforts.
Maintain and ensure operational excellence of systems and applications supporting Managed Services infrastructure.
Troubleshoot and resolve issues in a fast-paced, distributed environment, focusing on root cause analysis and post-incident reviews.
Build and maintain observability frameworks using tools like Prometheus, OpenTelemetry, and Dynatrace to ensure reliable monitoring of systems and applications.
Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and manage Error Budgets to balance reliability and innovation.
Use automation tools like Ansible and scripting languages like Python to reduce operational toil and optimize processes.
Manage and troubleshoot Kubernetes clusters and containerized environments, ensuring smooth service-to-service communication.
Diagnose and resolve networking issues across OSI layers 1-3 on systems using including packet capture analysis.
Oversee PKI certificate management and utilize tools like HashiCorp Vault for secrets management.
Collaborate with cross-functional teams to identify system improvements and resolve critical issues.
Provide mentoring and guidance to junior engineers, ensuring knowledge sharing and professional growth.

Qualifications:
Core Technical Skills:

Proficiency in Linux administration (system tuning, SSH, log analysis with tools like grep and regex).
Strong understanding of networking protocols and troubleshooting (Layer 1-3).
Hands-on experience with Kubernetes and container orchestration.
Automation experience with Ansible and scripting proficiency in Python.
Knowledge of PKI certificate management and HashiCorp Vault or similar tools for secrets management.
Expertise in monitoring and observability tools like Prometheus, Grafana, or Dynatrace.

Reliability Engineering Skills:

Experience defining and managing SLIs, SLOs, and Error Budgets.
Proven ability to instrument and analyze system performance metrics.
Deep familiarity with troubleshooting distributed systems and microservices architecture.

Soft Skills:

Strong initiative and curiosity for deep-diving into complex issues.
Excellent communication skills to convey solutions and collaborate across teams.
Ability to prioritize and resolve competing priorities in high-pressure situations.

Preferred:

Experience with observability frameworks like OpenTelemetry.
Familiarity with ITIL frameworks and best practices in incident and problem management.
Background in enterprise environments, navigating corporate processes and bureaucracy.
Security experience in areas like RBAC, least privilege principles, and secure software development lifecycle processes.
Certifications in Kubernetes or relevant technologies

Any Gradute

Back To Jobs