Key Skills: Azure, Kubernetes, Terraform, Bash Shell, AWS, GCP, Microservices, Python, Observability, Site Reliability Engineer, Data Governance, Data Privacy.
Roles & Responsibilities:
- Define and evolve the observability vision and roadmap for our cloud-native healthcare platform.
- Architect and implement standardized observability frameworks (metrics, logs, traces, events, profiling) across services.
- Collaborate with platform, SRE, and product teams to instrument services using OpenTelemetry and other modern observability tooling.
- Build and maintain dashboards, alerts, and SLOs that reflect both technical and business health indicators.
- Evaluate, integrate, and optimize observability tools (e.g., Datadog, Prometheus, Grafana, Tempo, Loki, Elastic).
- Lead incident analysis and postmortem reviews, driving improvements in system resilience and observability coverage.
- Ensure observability practices align with healthcare compliance standards (e.g., HIPAA, HITRUST).
- Mentor engineers and promote a culture of observability-first development.
Experience Required:
- 12 - 20 years of experience leading SRE or observability initiatives in large-scale distributed systems.
- Strong background in cloud platforms including Azure, AWS, and GCP with deep understanding of microservices architecture.
- Hands-on expertise in Terraform, Kubernetes, and modern observability tools.
- Demonstrated success in building resilient systems and establishing SLOs, SLIs, and SLA practices.
- Experience with compliance-driven environments such as healthcare or finance.
- Strong scripting and automation skills using Python, Bash Shell, and related technologies.
- Exceptional communication and leadership skills, with experience mentoring engineering teams.
Education: B.Tech M.Tech (Dual), B.Tech