Description

We are looking for an experienced Site Reliability Engineering (SRE) Lead to manage and improve the reliability, performance, and scalability of our systems. You will lead a small team, work with developers and operations, and ensure smooth running of production environments.

Key Responsibilities:

  • Lead and guide the SRE team.
  • Monitor, maintain, and improve system reliability and uptime.
  • Automate operational processes wherever possible.
  • Troubleshoot production issues and perform root cause analysis.
  • Collaborate with development and DevOps teams for deployments and upgrades.
  • Create and maintain documentation.

Required Skills:

  • Proven experience as an SRE or similar role.
  • Strong skills in cloud platforms (AWS, Azure, or GCP).
  • Proficiency in automation tools (Terraform, Ansible, etc.).
  • Knowledge of CI/CD pipelines.
  • Strong scripting skills (Python, Bash, etc.).
  • Good understanding of monitoring tools (Prometheus, Grafana, Datadog, etc.).
  • Excellent problem-solving and communication skills

Education

Any Gradute