Site Reliability Engineer

Build, deploy safely and incrementally and operate critical production systems with focus on scalability, reliability, observability, performance and security.
Monitor, support and enhance developer experience across services.
Build automation to remove toil and efficiently operate production systems.
Proactively monitor, respond to, and enhance alerts and set up automated alert handling
Create and maintain the incident response runbooks.
Triage platform/infrastructural issues and help Arista software engineers in their triages. Engage with 3rd party vendor support.
Write postmortem documents and build solutions to avoid incidents from repeating.
Plan and communicate maintenance windows on production systems.
Work with Arista’s product development teams to identify infrastructural issues that are causing bottlenecks and limitations in their workflows. Design and implement solutions to resolve them.
Survey and adopt best practices around infrastructure/platform to maintain secure, scalable and fault-tolerant systems.
Study the design and sufficient implementation details of OSS systems for better triage and fix resolution.

Qualifications

Essential Skills

At least BSc Computer Science or Engineering + 3 years’ experience, MS Computer Science or Engineering + 3 years’ experience, or equivalent work experience.
Knowledge of one or more of Go, Python, shell scripting to be able to implement medium complexity automation workflows.
Knowledge of Linux (or UNIX) from administration and debugging perspective
Hands-on experience in operating software systems (infrastructure, complex applications etc) at scale
Experience in server provisioning (esp from storage and networking perspective).
Strong problem solving and software troubleshooting skills
Experience with infrastructure-as-code

Desired Skills

Experience managing databases - mariadb, postgres, mongodb etc
Experience with docker and virtualization technologies - kvm, qemu, kata-containers etc
Experience managing monitoring stack - Prometheus, Loki, Tempo, InfluxDB, Grafana, Thanos etc
Experience managing ElasticSearch clusters
Experience managing Artifactory, docker registry etc
Experience managing CI/CD systems like ArgoCD, Spinnaker etc
Experience managing version control systems like Perforce, Gerrit etc
Experience with infrastructure-as-code frameworks like Ansible
Experience managing large Java applications
Experience in storage infrastructure management eg: NAS, SAN, Ceph etc

Any Gradute

Back To Jobs