Description

Key Responsibilities:
● Develop a scalable, multi-tenant observability platform using open-source technologies, covering centralized logging, monitoring, alerting, APM, system status pages, and dashboards.
● Partner with service owners and key stakeholders to define SLAs, SLOs, and SLIs for business-critical systems.
● Contribute to the architecture, deployment, and operation of enterprise-class, secure, instrumented, and highly available infrastructure and services.
● Create or integrate automation frameworks and infrastructure tools using technologies like terraform/opentofu, salt-stack/ansible, Docker, Kubernetes/Helm, bash, and Python.
● Routinely analyze and address system inefficiencies and performance bottlenecks to ensure infrastructure scalability.
● Advise internal and external partners on optimal approaches to specific technical tasks.

What You Should Have:
● Over 5 years of hands-on experience in a relevant technical role
● Deep familiarity with configuration and provisioning management platforms
● Extensive background working with open-source alerting and monitoring tools
● Proven experience deploying multi-tenant Zabbix alerting platforms
● Strong programming capabilities in one or more languages, such as Python
● Solid hands-on knowledge of Kubernetes and container-based architectures
● Demonstrated ability to build multi-tenant observability solutions using open-source technology

Preferred Qualifications:
● Exposure to Artifact Management tools (e.g., Artifactory, Nexus)
● Hands-on experience managing Gerrit
● Practical knowledge in both on-prem and public cloud environments (AWS, Azure, or GCP), including architecture design, platform constraints, and cost-saving methods
● Background in leading and executing large-scale tech redesign initiatives

Education

Any Gradute