* Design and implement highly automated systems/services that ensure the availability, reliability, and scalability of infrastructure and applications.
* Build and maintain monitoring and alerting to provide timely feedback on the performance and health of systems, network, and applications.
* Design and implement automation tools to reduce manual toil, streamline repetitive tasks, and enhance overall operational efficiency.
* Design and build Service Level Indicator (SLIs) metrics, including but not limited to Service Level Objectives (SLOs), Error Budget, Burn Rate Alerts
* Work closely with development teams to embed reliability best practices into the software development process. Provide mentorship and training to cross-functional teams on SRE principles, encouraging a shared responsibility for the reliability of our services.
* Collaborating with our support, operations and engineering teams to investigate and troubleshoot complex problems
* Observe and monitor systems to make sure you have the insight into system performance, health, availability and what is happening internally in the system.
* Understands what to monitor based on the system(s) you are managing, how the monitoring data is stored, and how to look at the data to make determinations about future actions.
* Participates in continuous improvement efforts that span multiple multi-functional domains and informs the generation of new standards
* Be a part of an on-call rotation, continuously enhance automation & documentation, and mentor others on the standard methodologies of infrastructure automation to encourage adoption.
* Able to overcome differences of opinion and drive team alignment around a specific goal or solution
* Holds associates and teams accountable for adhering to practices and policies
Any Gradute