Description

Role & Responsibilities

We are looking for a dedicated Site Reliability Engineer (SRE) - Cloud Ops to join our team. In this role, you will play a key part in ensuring the stability and scalability of our cloud infrastructure. You will be responsible for monitoring, troubleshooting, and resolving infrastructure and application alerts, managing pipelines, and addressing environment-related issues in a dynamic 24/7 operational environment.

Key Responsibilities:

  • Infrastructure Monitoring and Alert Response: Proactively monitor infrastructure and application alerts, ensuring prompt resolution to maintain uptime and performance.
  • Shift-Based Operations: Work in a 24/7 environment with flexible availability for rotational shifts.
  • Cloud Environment Management: Manage and resolve environment-related issues, focusing on stability and efficiency.
  • Pipeline Management: Oversee CI/CD pipelines and ensure smooth deployment of updates and releases.
  • Operational Tasks: Execute day-to-day operational activities, including incident management, change management, and maintaining operational excellence.
  • Tool Management: Utilize tools like Kubernetes, PagerDuty, and GCP Cloud to support operational activities.

Ideal Candidate

  • B.E/B.Tech graduate with 1+ years of experience in Site Reliability, Cloud Ops
  • Monitoring and Alerting Expertise: In-depth knowledge of monitoring tools (Prometheus, Grafana, ELK ) , alert systems, and resolving related issues promptly.
  • Kubernetes: Hands-on experience with Kubernetes for orchestration and container management.
  • PagerDuty: Proficiency in setting up and managing alerting systems.
  • Cloud Fundamentals: Basic understanding of GCP (Google Cloud Platform) services and operations.
  • Incident Management: Strong problem-solving skills and experience in handling critical incidents under pressure.
  • DevOps Processes: Basic knowledge of CI/CD pipelines, automation, and infrastructure-as-code practices

Education

Any Graduate