· Lead and manage the RCA process for all SRE incidents, ensuring a thorough and timely investigation.
· Facilitate RCA workshops, guiding teams through a structured analysis to identify the root cause of incidents.
· Document RCA findings and recommendations in a clear and concise manner.
· Work with SRE engineers and developers to implement corrective actions and preventative measures based on RCA findings.
· Analyze trends in incident data to identify areas for improvement in system design, monitoring, and automation.
· Develop and implement best practices for RCA within the SRE organization.
· Stay up-to-date on the latest SRE practices and incident response methodologies.
· Collaborate with other teams (e.g., security, product) to ensure a holistic approach to incident management.
· Mentor and coach SRE engineers on effective RCA techniques.
· Track and report on key metrics related to incident management and RCA effectiveness.
Any Graduate