Description

Role Description:

Run the production environment by monitoring availability and taking a holistic view of system health

Build software and systems to manage platform infrastructure and applications

Improve reliability, quality, and time-to-market of our suite of software solutions

Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement

Provide primary operational support and engineering for multiple large-scale distributed software applications

Responsibilities

Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding

Partner with development teams to improve services through rigorous testing and release procedures

Participate in system design consulting, platform management, and capacity planning

Create sustainable systems and services through automation and uplifts

Balance feature development speed and reliability with well-defined service-level objectives

Support production support and configuration of monitoring tools such as Icinga, Graphite, Splunk, Nagios

Perform alert configuration, alert suppression, addition/ deletion of end points.

Required skills and qualifications

Bachelor degree (or equivalent) in computer science or related discipline

Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, Golang, and JavaScript

Proactive approach to identifying problems, performance bottlenecks, and areas for improvement

Preferred skills and qualifications

Previous success in technical engineering

Coding experience beyond simple scripts

Education

Bachelor degree