Description

Gardener team is adopting the Kubernetes open-source system for automating deployment, scaling, and management of containerized SAP solutions and business applications. Evolving and enhancing the open source project Gardener as SAP’s way to provide a homogeneous and universal Kubernetes-based container management is the key deliverable of this team, in close collaboration with the teams in other SAP locations. Project Gardener is providing solutions to run and orchestrate Kubernetes clusters on public cloud, hybrid or SAP owned infrastructures for a variety of enterprise use cases.

Site Reliability Engineering team in Gardener organization is ensuring the Live Site First culture, reacts on alerts, performs proactive monitoring, prepares, and applies hotfixes or critical configuration changes, performs Root Cause Analysis, implements automations and improvements in various components around the test machinery, monitoring and observability stacks.

Responsibilities:

  • Develop, maintain, and enhance software-based solutions to achieve improvements in service stability, reliability, and operations
  • Act as technical expert during incidents, investigate and solve incidents on a deep technical level. Perform troubleshooting and log analysis to identify and solve issues in accordance with internal and external SLAs
  • Drive root cause analysis and follow-up improvements to prevent reoccurring issues
  • Learn new technologies and keep up to date with latest product releases
  • Work with tools like Concourse, GitHub & GitHub Actions, Grafana, Prometheus, Prow
  • Use programming languages like Go, Python and Bash
  • Understanding Kubernetes is a must! Certified Kubernetes Administrator (CKA) is a plus or taking the certificate is expected within a year
  • Voluntarily weekend OnCall duties
  • Experience with Istio, Linux and security hardening procedures is a plus

Education

Bachelor's or Master's degrees