Description

  • Work with a variety of tools and technologies to ensure the reliability and performance of the Managed Services organization.
  • Drive initiatives to optimize and secure infrastructure operations while contributing to scalability and continuous improvement efforts.
  • Maintain and ensure operational excellence of systems and applications supporting Managed Services infrastructure.
  • Troubleshoot and resolve issues in a fast-paced, distributed environment, focusing on root cause analysis and post-incident reviews.
  • Build and maintain observability frameworks using tools like Prometheus, OpenTelemetry, and Dynatrace to ensure reliable monitoring of systems and applications.
  • Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and manage Error Budgets to balance reliability and innovation.
  • Use automation tools like Ansible and scripting languages like Python to reduce operational toil and optimize processes.
  • Manage and troubleshoot Kubernetes clusters and containerized environments, ensuring smooth service-to-service communication.
  • Diagnose and resolve networking issues across OSI layers 1-3 on systems using including packet capture analysis.
  • Oversee PKI certificate management and utilize tools like HashiCorp Vault for secrets management.
  • Collaborate with cross-functional teams to identify system improvements and resolve critical issues.
  • Provide mentoring and guidance to junior engineers, ensuring knowledge sharing and professional growth.

Qualifications:
Core Technical Skills:

  • Proficiency in Linux administration (system tuning, SSH, log analysis with tools like grep and regex).
  • Strong understanding of networking protocols and troubleshooting (Layer 1-3).
  • Hands-on experience with Kubernetes and container orchestration.
  • Automation experience with Ansible and scripting proficiency in Python.
  • Knowledge of PKI certificate management and HashiCorp Vault or similar tools for secrets management.
  • Expertise in monitoring and observability tools like Prometheus, Grafana, or Dynatrace.

Reliability Engineering Skills:

  • Experience defining and managing SLIs, SLOs, and Error Budgets.
  • Proven ability to instrument and analyze system performance metrics.
  • Deep familiarity with troubleshooting distributed systems and microservices architecture.

Soft Skills:

  • Strong initiative and curiosity for deep-diving into complex issues.
  • Excellent communication skills to convey solutions and collaborate across teams.
  • Ability to prioritize and resolve competing priorities in high-pressure situations.

Preferred:

  • Experience with observability frameworks like OpenTelemetry.
  • Familiarity with ITIL frameworks and best practices in incident and problem management.
  • Background in enterprise environments, navigating corporate processes and bureaucracy.
  • Security experience in areas like RBAC, least privilege principles, and secure software development lifecycle processes.
  • Certifications in Kubernetes or relevant technologies

Education

Any Gradute