What You'll Do:
In this role, you will be continually evaluating the performance of our Elasticsearch and Splunk clusters in order to spot developing problems, planning changes for upcoming high-load events, applying security fixes, testing and performing incremental upgrades, and extending and improving our monitoring and alert infrastructure.
You’ll also be involved in maintaining other parts of the data pipeline, which may include server less or server-based systems for feeding data into the Elasticsearch pipeline.
We’re continually trying to optimize our cost-vs-performance position, and so testing new types of hosts or configurations is an ongoing focus.
We do much of our work with declarative tools such as Puppet, and various scripting mechanisms (depending on the target environment). In general, we want to automate as much as possible and aim for a ‘build once/run everywhere’ system.
Some of our Elasticsearch clusters are in the public cloud; some are in Kubernetes clusters, and some are in private datacenter. This will be an opportunity to work with a variety of types of infrastructure and operations teams.
Build long-lasting, effective partnerships across the organization to foster collaboration between Product, Engineering and Operations teams.
Participate in an on-call rotation and be willing to jump on escalated issues as needed.
Expertise You'll Bring:
Bachelor's Degree in computer science, information technology, engineering, or related discipline required
Expertise in administration and management of clusters.(Primary)
Expertise in administration and management of Splunk clusters.(Secondary)
Strong Knowledge in provisioning and Configuration Management tools like Puppet, Ansible, Rundeck, etc.
Experience in building Automations and Infrastructure as Code using Terraform, Packer or CloudFormation templates.(Plus)
Experience with monitoring and logging tools like Splunk, Prometheus, PagerDuty, etc.
Experience with scripting languages like Python, Bash, Go, Ruby, Perl, etc.
Experience with CI/CD tools like Jenkins, Pipelines, Artifactory, etc.
An inquisitive mind with the ability to learn where the data exists in a large and disparate system and what that data means
The skills to do effective troubleshooting, following a problem wherever it may lead
Any Graduate