Description

We are seeking an experienced Observability Engineer with a strong DevOps background to design, implement, and manage observability solutions across cloud and on-prem environments. The ideal candidate will have expertise in monitoring, logging, tracing, and alerting to ensure high system availability, performance, and reliability.
 

Key Responsibilities:
 

  • Design & Implement Observability Solutions: Develop and maintain monitoring, logging, and tracing solutions using industry-leading tools (Prometheus, Grafana, Datadog, New Relic, Splunk, etc.).
     
  • Performance Monitoring & Optimization: Ensure proactive identification and resolution of performance bottlenecks in distributed systems.
     
  • Logging & Tracing: Set up and manage centralized logging solutions (ELK/EFK stack, Fluentd, OpenTelemetry).
     
  • Alerting & Incident Management: Configure alerting mechanisms using tools like PagerDuty, Ops genie, or VictorOps for proactive issue detection.
     
  • SRE Practices: Implement Site Reliability Engineering (SRE) principles to enhance system reliability and reduce MTTR (Mean Time to Resolution).
     
  • Automation & Infrastructure as Code (IaC): Automate observability setup and configurations using Terraform, Ansible, or similar tools.
     
  • Cloud & Kubernetes Monitoring: Implement observability best practices for cloud platforms (AWS, Azure, GCP) and containerized environments (Kubernetes, Docker).
     
  • Collaboration: Work closely with development, SRE, and operations teams to ensure end-to-end observability of applications and services.
     
  • Compliance & Security: Ensure logging and monitoring solutions adhere to security and compliance requirements.
     


 

Requirements

Required Skills & Qualifications:
 

  • 6-10 years of experience in DevOps, SRE, or Observability engineering.
     
  • Strong hands-on experience with observability tools like Prometheus, Grafana, New Relic, Datadog, Splunk, ELK/EFK, OpenTelemetry, AppDynamics, etc.
     
  • Experience in setting up distributed tracing solutions (Jaeger, Zipkin, OpenTelemetry).
     
  • Expertise in Kubernetes monitoring using Prometheus, Thanos, Loki, or similar tools.
     
  • Strong proficiency in scripting (Python, Bash, Shell) for automation.
     
  • Hands-on experience with Terraform, Ansible, Helm, or CloudFormation for infrastructure automation.
     
  • Proficiency in CI/CD pipelines and GitOps methodologies using Jenkins, GitLab CI, ArgoCD, or Flux.
     
  • Experience in public cloud environments (AWS, Azure, GCP) and monitoring cloud-native services.
     
  • Strong troubleshooting and root cause analysis (RCA) skills.
     
  • Understanding of SLIs, SLOs, and error budgets as part of SRE best practices.
     
  • Familiarity with log management, anomaly detection, and AI-based observability solutions is a plus

Education

Any Gradute