Description

Key Skills: Azure, Kubernetes, Terraform, Bash Shell, AWS, GCP, Microservices, Python, Observability, Site Reliability Engineer, Data Governance, Data Privacy.

Roles & Responsibilities:

  • Define and evolve the observability vision and roadmap for our cloud-native healthcare platform.
  • Architect and implement standardized observability frameworks (metrics, logs, traces, events, profiling) across services.
  • Collaborate with platform, SRE, and product teams to instrument services using OpenTelemetry and other modern observability tooling.
  • Build and maintain dashboards, alerts, and SLOs that reflect both technical and business health indicators.
  • Evaluate, integrate, and optimize observability tools (e.g., Datadog, Prometheus, Grafana, Tempo, Loki, Elastic).
  • Lead incident analysis and postmortem reviews, driving improvements in system resilience and observability coverage.
  • Ensure observability practices align with healthcare compliance standards (e.g., HIPAA, HITRUST).
  • Mentor engineers and promote a culture of observability-first development.

Experience Required:

  • 12 - 20 years of experience leading SRE or observability initiatives in large-scale distributed systems.
  • Strong background in cloud platforms including Azure, AWS, and GCP with deep understanding of microservices architecture.
  • Hands-on expertise in Terraform, Kubernetes, and modern observability tools.
  • Demonstrated success in building resilient systems and establishing SLOs, SLIs, and SLA practices.
  • Experience with compliance-driven environments such as healthcare or finance.
  • Strong scripting and automation skills using Python, Bash Shell, and related technologies.
  • Exceptional communication and leadership skills, with experience mentoring engineering teams.

Education: B.Tech M.Tech (Dual), B.Tech

Education

Any Graduate