Description

o Design and implement strategies to ensure high availability, reliability, and performance of systems and services.

o Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.

· Incident Management & Troubleshooting

o Respond to system outages and incidents, lead post-mortem investigations, and implement preventive measures.

o Create runbooks and automate recovery processes to reduce manual intervention.

o Share the on-call rotation and be an escalation contact for incidents.

· Infrastructure as Code (IaC)

o Build and maintain infrastructure using tools like Terraform.

o Ensure infrastructure is reproducible, version-controlled, and auditable.

· Monitoring & Observability

o Implement and manage monitoring tools (preferably Splunk).

o Set up alerts and dashboards to track the health and performance of services.

· Automation & Tooling

o Automate operational tasks such as deployments, scaling, backups, and failovers.

o Develop internal tools to support deployment pipelines and team workflows.

· Collaboration with Development & Operations

o Work closely with developers to design systems that are scalable and supportable.

o Advocate for and implement best practices around CI/CD, testing, and release management.

 

Required Skillset

· Programming & Scripting

o Proficiency in languages like Python, Bash, or Ruby.

o Ability to build tools, automate tasks, and debug production issues.

· Cloud Platforms

o Strong experience with cloud providers (GCP, Azure).

o Knowledge of cloud-native services, networking, and security.

· Linux/Unix Systems/Windows

o Deep understanding of system internals, performance tuning, and debugging.

· Containers & Orchestration

o Experience with Docker and Kubernetes (or other orchestration platforms).

· CI/CD & Automation Tools

o Familiarity with Jenkins, Github Actions, ArgoCD, or similar.

o Experience setting up and managing deployment pipelines.

· Monitoring & Logging

o Knowledge of observability stacks.

· Security & Compliance Awareness

o Understanding of securing systems and managing access control, secrets, and audit logging.

 

Education

Any Gradute