We are seeking a skilled Senior Observability Engineer to design, implement, and optimize observability solutions across cloud platforms and hybrid environments. The ideal candidate will have strong experience in cloud infrastructure (preferably OCI or other platforms), automation tools, observability stacks, and container orchestration. The role involves building scalable, resilient monitoring systems that ensure infrastructure and application performance, security, and availability.
Key Responsibilities:
Architecture & Design
Design and implement end-to-end observability solutions leveraging tools like
Grafana, Prometheus, Zabbix, Nagios, Loki, Elastic Stack, or Open Telemetry.
Architect scalable and fault-tolerant infrastructure monitoring for OCI cloud environment.
Build robust observability stacks to enable application performance monitoring (APM), infrastructure metrics, and log aggregation.
Infrastructure as Code (IaC)
Use Terraform to automate and manage infrastructure deployments and monitoring configurations.
Collaborate with DevOps teams to maintain IaC standards and CI/CD workflows.
Prior experience with ansible and puppet will be a plus point
Observability & Monitoring
Deploy and configure Prometheus for metrics collection and alerting.
Build custom dashboards and visualizations in Grafana to monitor system health and performance.
Set up Osquery for endpoint visibility and security monitoring.
Develop monitoring frameworks for Docker containers and Kubernetes clusters.
CI/CD Pipeline
Hands-on experience with deploying infrastructure using Jenkins as the CI/CD tool.
Knowledge with any other CI/CD environments will be favorable
Containerization & Orchestration
Develop basic-to-medium level Docker configurations to containerize monitoring solutions.
Configure and optimize Kubernetes clusters for observability, logging, and monitoring.
Collaboration & Leadership
Work with cross-functional teams (DevOps, Cloud Engineering, Application Development) to align monitoring objectives.
Provide technical guidance to junior team members on best practices for monitoring and observability.
Partner with security teams to ensure compliance and security in monitoring solutions.
Key Qualifications:
Technical Skills:
Proficient in cloud infrastructure (preferably OCI, AWS, GCP, or Azure).
Strong knowledge of Grafana, Prometheus, Zabbix and Nagios.
Experience with Terraform and CI/CD pipelines.
Working knowledge of container platforms (Docker, Kubernetes).
Expertise in setting up observability stacks, including logging, metrics, and tracing.
Understanding of Linux/Windows system internals and basic networking concepts.
Expertise in scripting languages like Go-Lang, python, perl, etc.
Soft Skills:
Strong problem-solving and analytical skills.
Effective communication with technical and non-technical teams.
Team player with leadership abilities to drive monitoring best practices.
Preferred Experience
8+ years in cloud infrastructure or monitoring roles.
Hands-on experience in implementing observability tools and stacks.
Proven track record of improving infrastructure reliability through monitoring automation.
Any Graduate