Description

  • We are looking for a highly skilled Site Reliability Engineer (SRE) with strong experience in Node.js and Java to help us scale and maintain high-performance, resilient, and secure systems. You'll collaborate with software engineers, DevOps, and platform teams to improve observability, automate operations, and ensure system reliability in production environments.


 

Key Responsibilities:

  • Design, build, and maintain scalable and reliable infrastructure for microservices built in Node.js and Java.
  • Develop monitoring and alerting strategies (e.g., Prometheus, Grafana, ELK, Datadog) to improve system observability.
  • Implement and manage CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub Actions).
  • Automate infrastructure provisioning using tools like Terraform, Ansible, or Helm.
  • Collaborate with development teams to improve deployment processes, release velocity, and system performance.
  • Drive incident management processes, root cause analysis, and post-mortems.
  • Build tools and services to manage infrastructure, reduce toil, and improve developer productivity.
  • Apply SRE principles: SLAs, SLOs, SLIs, capacity planning, and error budgets.
  • Optimize application performance and troubleshoot issues across the stack (network, OS, app, DB).


 

Required Qualifications:

  • 3–6+ years of experience with SRE.
  • Proficient in Node.js and Java for backend service development and debugging.
  • Experience with cloud platforms: AWS / GCP / Azure.
  • Strong knowledge of Kubernetes, Docker, and container orchestration.
  • Experience in monitoring/logging tools: Prometheus, Grafana, ELK, Datadog, New Relic, etc.
  • Solid understanding of Linux systems, networking, and distributed systems.
  • Familiarity with infrastructure-as-code tools (e.g., Terraform, Pulumi, CloudFormation)

Education

Any Gradute