o Design and implement strategies to ensure high availability, reliability, and performance of systems and services.
o Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
· Incident Management & Troubleshooting
o Respond to system outages and incidents, lead post-mortem investigations, and implement preventive measures.
o Create runbooks and automate recovery processes to reduce manual intervention.
o Share the on-call rotation and be an escalation contact for incidents.
· Infrastructure as Code (IaC)
o Build and maintain infrastructure using tools like Terraform.
o Ensure infrastructure is reproducible, version-controlled, and auditable.
· Monitoring & Observability
o Implement and manage monitoring tools (preferably Splunk).
o Set up alerts and dashboards to track the health and performance of services.
· Automation & Tooling
o Automate operational tasks such as deployments, scaling, backups, and failovers.
o Develop internal tools to support deployment pipelines and team workflows.
· Collaboration with Development & Operations
o Work closely with developers to design systems that are scalable and supportable.
o Advocate for and implement best practices around CI/CD, testing, and release management.
Required Skillset
· Programming & Scripting
o Proficiency in languages like Python, Bash, or Ruby.
o Ability to build tools, automate tasks, and debug production issues.
· Cloud Platforms
o Strong experience with cloud providers (GCP, Azure).
o Knowledge of cloud-native services, networking, and security.
· Linux/Unix Systems/Windows
o Deep understanding of system internals, performance tuning, and debugging.
· Containers & Orchestration
o Experience with Docker and Kubernetes (or other orchestration platforms).
· CI/CD & Automation Tools
o Familiarity with Jenkins, Github Actions, ArgoCD, or similar.
o Experience setting up and managing deployment pipelines.
· Monitoring & Logging
o Knowledge of observability stacks.
· Security & Compliance Awareness
o Understanding of securing systems and managing access control, secrets, and audit logging.
Any Gradute