Site Reliability Engineer

We are looking for a highly skilled Site Reliability Engineer (SRE) with strong experience in Node.js and Java to help us scale and maintain high-performance, resilient, and secure systems. You'll collaborate with software engineers, DevOps, and platform teams to improve observability, automate operations, and ensure system reliability in production environments.

Key Responsibilities:

Design, build, and maintain scalable and reliable infrastructure for microservices built in Node.js and Java.
Develop monitoring and alerting strategies (e.g., Prometheus, Grafana, ELK, Datadog) to improve system observability.
Implement and manage CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub Actions).
Automate infrastructure provisioning using tools like Terraform, Ansible, or Helm.
Collaborate with development teams to improve deployment processes, release velocity, and system performance.
Drive incident management processes, root cause analysis, and post-mortems.
Build tools and services to manage infrastructure, reduce toil, and improve developer productivity.
Apply SRE principles: SLAs, SLOs, SLIs, capacity planning, and error budgets.
Optimize application performance and troubleshoot issues across the stack (network, OS, app, DB).

Required Qualifications:

3–6+ years of experience with SRE.
Proficient in Node.js and Java for backend service development and debugging.
Experience with cloud platforms: AWS / GCP / Azure.
Strong knowledge of Kubernetes, Docker, and container orchestration.
Experience in monitoring/logging tools: Prometheus, Grafana, ELK, Datadog, New Relic, etc.
Solid understanding of Linux systems, networking, and distributed systems.
Familiarity with infrastructure-as-code tools (e.g., Terraform, Pulumi, CloudFormation)

Any Gradute

Back To Jobs