We are looking for a Site Reliability Engineer (SRE) with deep expertise in designing and implementing full observability frameworks—including telemetry, instrumentation, distributed tracing, and metrics pipelines.
This role is engineering-heavy and suited for someone who thrives in building scalable observability platforms rather than simply reacting to monitoring alerts.
The ideal candidate also brings strong experience in AWS cloud architecture and a builder’s mindset to reliability engineering.
Key Responsibilities:
Architect and implement end-to-end observability platforms including logs, metrics, traces, and events across distributed systems.
Build and manage telemetry pipelines using open standards like OpenTelemetry, Prometheus, Grafana, and AWS-native tools (CloudWatch, X-Ray, etc.).
Embed observability as code into CI/CD pipelines and infrastructure provisioning tools.
Partner with application and platform teams to define and implement SLIs, SLOs, and error budgets as engineering primitives.
Develop and maintain custom instrumentation libraries to provide actionable insights across services.
Engineer reliable, self-service observability tooling to empower development teams.
Drive cloud-native observability patterns on AWS, optimizing for performance, scalability, and cost.
Actively participate in post-incident reviews to improve system design and observability strategy.
Collaborate with SRE, DevOps, and Platform teams to align reliability objectives with business goals.
Required Skills & Experience:
Proven track record of building observability solutions at scale (not just using tools).
Strong hands-on expertise with OpenTelemetry, Prometheus, Grafana, ELK, CloudWatch, X-Ray, etc.
Advanced knowledge of AWS cloud architecture and services.
Proficient in at least one modern programming language (e.g., Python, Go, Java).
Experience with IaC tools such as Terraform or CloudFormation.
Deep understanding of SLI/SLO/SLA concepts, service health indicators, and telemetry standards.
Familiarity with containerization and orchestration (Docker, Kubernetes).
Ability to build reusable components, SDKs, or libraries that enable observability at scale.
Preferred Qualifications:
AWS Certifications (DevOps Engineer, Solutions Architect, etc.).
Experience contributing to open-source observability tools.
Background in software engineering or platform reliability architecture
Any Gradute