- Design, implement, and optimize observability solutions across metrics, logging, and tracing.
- Build and maintain dashboards and alerts (e.g., Datadog) that provide meaningful insight into system health and performance.
- Define and support adoption of Service Level Objectives (SLOs), Indicators (SLIs), and error budgets.
Incident & Problem Management
- Participate in and lead incident response efforts during major outages and critical events.
- Support on-call rotations, particularly during key business events (e.g., product launches, holiday traffic).
- Conduct and contribute to Root Cause Analyses (RCAs) and post-incident reviews, driving follow-up actions and long-term remediation plans.
- Collaborate with partner teams to enhance incident playbooks, reduce mean time to detect (MTTD) and resolve (MTTR), and improve operational readiness.
- Apply principles of the ITIL framework in areas such as incident, problem, and change management, ensuring alignment with organizational reliability goals.
Team Collaboration & Enablement
- Partner with digital product teams to integrate observability best practices into their development and deployment workflows.
- Identify tooling and knowledge gaps; champion improvements and automation initiatives that reduce toil and increase visibility.
- Support product owners and engineering leads with prioritization between support, investment, and innovation work.
- Mentor junior team members and advocate for team-wide knowledge sharing and continuous improvement.
Continuous Improvement & Strategic Contribution
- Stay up to date with SRE and observability trends, helping to evaluate and adopt new tools and approaches.
- Contribute to domain-level standards and practices within the broader technology organization.
- Influence reliability strategy by sharing insights, performance metrics, and “what’s working/what’s not” feedback with senior engineers and technical leadership.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
- 8–12 years of experience in software engineering or SRE, with deep exposure to observability and monitoring.
- Strong experience with observability tools such as Datadog, Splunk, and distributed tracing frameworks.
- Proven track record in incident management, RCA facilitation, and on-call response — especially during critical peak traffic events.
- Understanding of ITIL concepts including Incident, Problem, and Change Management.
- Experience building and maintaining dashboards, alerts, and SLOs/SLIs.
- Strong debugging and root cause analysis skills across complex distributed systems.
- Excellent collaboration, documentation, and communication skills.
- Familiarity with infrastructure-as-code (e.g., Terraform), Kubernetes, and cloud-native systems.
- Relevant certifications (e.g., Certified Kubernetes Administrator, Terraform Associate) are a plus