We are looking for an experienced Site Reliability Engineering (SRE) Lead to manage and improve the reliability, performance, and scalability of our systems. You will lead a small team, work with developers and operations, and ensure smooth running of production environments.
Key Responsibilities:
- Lead and guide the SRE team.
- Monitor, maintain, and improve system reliability and uptime.
- Automate operational processes wherever possible.
- Troubleshoot production issues and perform root cause analysis.
- Collaborate with development and DevOps teams for deployments and upgrades.
- Create and maintain documentation.
Required Skills:
- Proven experience as an SRE or similar role.
- Strong skills in cloud platforms (AWS, Azure, or GCP).
- Proficiency in automation tools (Terraform, Ansible, etc.).
- Knowledge of CI/CD pipelines.
- Strong scripting skills (Python, Bash, etc.).
- Good understanding of monitoring tools (Prometheus, Grafana, Datadog, etc.).
- Excellent problem-solving and communication skills