Description

  • Design, implement, and optimize observability solutions across metrics, logging, and tracing.
  • Build and maintain dashboards and alerts (e.g., Datadog) that provide meaningful insight into system health and performance.
  • Define and support adoption of Service Level Objectives (SLOs), Indicators (SLIs), and error budgets.

Incident & Problem Management

  • Participate in and lead incident response efforts during major outages and critical events.
  • Support on-call rotations, particularly during key business events (e.g., product launches, holiday traffic).
  • Conduct and contribute to Root Cause Analyses (RCAs) and post-incident reviews, driving follow-up actions and long-term remediation plans.
  • Collaborate with partner teams to enhance incident playbooks, reduce mean time to detect (MTTD) and resolve (MTTR), and improve operational readiness.
  • Apply principles of the ITIL framework in areas such as incident, problem, and change management, ensuring alignment with organizational reliability goals.

Team Collaboration & Enablement

  1. Partner with digital product teams to integrate observability best practices into their development and deployment workflows.
  2. Identify tooling and knowledge gaps; champion improvements and automation initiatives that reduce toil and increase visibility.
  3. Support product owners and engineering leads with prioritization between support, investment, and innovation work.
  4. Mentor junior team members and advocate for team-wide knowledge sharing and continuous improvement.

Continuous Improvement & Strategic Contribution

  1. Stay up to date with SRE and observability trends, helping to evaluate and adopt new tools and approaches.
  2. Contribute to domain-level standards and practices within the broader technology organization.
  3. Influence reliability strategy by sharing insights, performance metrics, and “what’s working/what’s not” feedback with senior engineers and technical leadership.

Qualifications

  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
  • 8–12 years of experience in software engineering or SRE, with deep exposure to observability and monitoring.
  • Strong experience with observability tools such as Datadog, Splunk, and distributed tracing frameworks.
  • Proven track record in incident management, RCA facilitation, and on-call response — especially during critical peak traffic events.
  • Understanding of ITIL concepts including Incident, Problem, and Change Management.
  • Experience building and maintaining dashboards, alerts, and SLOs/SLIs.
  • Strong debugging and root cause analysis skills across complex distributed systems.
  • Excellent collaboration, documentation, and communication skills.
  • Familiarity with infrastructure-as-code (e.g., Terraform), Kubernetes, and cloud-native systems.
  • Relevant certifications (e.g., Certified Kubernetes Administrator, Terraform Associate) are a plus

Education

Bachelor's degree