Role & Responsibilities
We are looking for a dedicated Site Reliability Engineer (SRE) - Cloud Ops to join our team. In this role, you will play a key part in ensuring the stability and scalability of our cloud infrastructure. You will be responsible for monitoring, troubleshooting, and resolving infrastructure and application alerts, managing pipelines, and addressing environment-related issues in a dynamic 24/7 operational environment.
Key Responsibilities:
- Infrastructure Monitoring and Alert Response: Proactively monitor infrastructure and application alerts, ensuring prompt resolution to maintain uptime and performance.
- Shift-Based Operations: Work in a 24/7 environment with flexible availability for rotational shifts.
- Cloud Environment Management: Manage and resolve environment-related issues, focusing on stability and efficiency.
- Pipeline Management: Oversee CI/CD pipelines and ensure smooth deployment of updates and releases.
- Operational Tasks: Execute day-to-day operational activities, including incident management, change management, and maintaining operational excellence.
- Tool Management: Utilize tools like Kubernetes, PagerDuty, and GCP Cloud to support operational activities.
Ideal Candidate
- B.E/B.Tech graduate with 1+ years of experience in Site Reliability, Cloud Ops
- Monitoring and Alerting Expertise: In-depth knowledge of monitoring tools (Prometheus, Grafana, ELK ) , alert systems, and resolving related issues promptly.
- Kubernetes: Hands-on experience with Kubernetes for orchestration and container management.
- PagerDuty: Proficiency in setting up and managing alerting systems.
- Cloud Fundamentals: Basic understanding of GCP (Google Cloud Platform) services and operations.
- Incident Management: Strong problem-solving skills and experience in handling critical incidents under pressure.
- DevOps Processes: Basic knowledge of CI/CD pipelines, automation, and infrastructure-as-code practices