Description

Responsibilities:

  • Configure and tune the observability platforms we use to streamline alerting and proactive issue identification.
  • Automate manual activities as new features are added to the platform.
  • Ensure the platform functional test suite remains complete and runs quickly.
  • Expand performance and load testing capability to better simulate real production use.
  • Perform root cause analysis owning actionable follow-ups.
  • Measure and optimize system performance setting proper metrics and SLAs/SLOs
  • Mange risk and control activities across the platform and team

Essential Skills:

  • 3+ SRE and operational role running a large enterprise platform
  • 5+ years working with Linux
  • 5+ years scripting in Python/Shell/Bash/Ksh.
  • 3+ years of software development experience (Java on Linux)
  • Extensive experience working with and customizing automation scripts – Ansible is currently used.
  • Prior experiences with DevOps CI/CD tools like Git and Jenkins.
  • Experience analyzing data to drive decisions.
  • Competent with API, web services and microservices development
  • Strong communication skills, both written and oral
  • Strong architecture and design
  • Strong analytical, algorithmic, and problem-solving skills
  • Excellent teamwork and proactive attitude
  • BS/MS degree in Computer Science or related technical field

Desired Skills:

  • Experience with Docker, Kubernetes, Openshift
  • Experience with databases like Oracle, MongoDB
  • Java Spring Framework development experience
  • Experience in Config Management tooling e.g. Ansible, Chef, Puppet or SaltStack
  • Experience in Splunk, Grafana, Prometheus
  • Experience writing automation tests
  • Ability to quickly learn new concepts and software

Education

BS/MS degree in Computer Science