Description

Overall Purpose

The Senior Site Reliability Engineer plays a critical role in the maintaining the reliability, availability, and performance . This position requires a blend of software engineering, systems engineering, and operation skills to build and maintain scalable, fault-tolerant systems.

 

Essential Functions

· Build systems set to measure and track SLOs and hold engineering teams accountable to meeting them

· Create and maintain relevant and documentation codifying institutional knowledge on applications

· Respond and participate in the recovery of business-critical incidents as port of our Incident Management Program

· Design and implement software and tools to improve the performance - availability, scalability, and latency, while delivering end products to customer with the highest efficiency and meeting all security standards

· Build automation and tooling around application management, such as deployments, configuration changes and disaster recovery scenarios

· Evaluate capacity of the application on a continuous basis to provide stats to the Product/Business teams and provide an efficient path to scale the infrastructure for future needs

· Identify performance bottlenecks and work with other infrastructure teams to troubleshoot and resolve issues.

· Implement standards across multiple disciplines, systems and practices to improve the overall application delivery.

· Work directly with development teams to provide feedback and technical requirements to the software development lifecycle.

· Participate in a 24X7 on-call rotation and act as an escalation point for the application and other underlying systems.

· Support the company's commitment to protect the integrity and confidentiality of systems and data.

 

Minimum Qualifications

· Education or experience equivalent to a Bachelor’s degree in computer science or engineering

· Proficiency in at least one programming language, eg. Python, Go, Java

· Working understanding of computer networking concepts including TCP/IP, UDP, Ethernet, Load Balancing, Application Proxies, etc

· Working understanding of Linux Unix systems with demonstrated ability to effectively troubleshoot, identify and resolve problems.

· Working knowledge Containerization technologies including Kubernetes and Docker

· Proficiency in Infrastructure as Code, eg. Terraform and Ansible

· Working knowledge of Git and CI/CD Tools, eg. Jenkins, Harness, GitLab, CI

· Comfort with facilitating collaboration, open communication and reaching across functional borders

· Ability to work and multi-task in a fast-paced environment.

· Excellent oral and written communication and interpersonal skills.

· High level of customer responsiveness, excellent documentation and communication skills and attention to detail.

· Hands-on experience in supporting applications in a 24X7 customer-facing production environment

· Background and drug screen.

 

Preferred Qualifications

· Experience with large-scale distributed systems with high uptime and performance requirements

· Working knowledge of RDBMS and NoSQL database systems

· Ability to write and debug SQL

· Working knowledge and understanding of Cloud application design patterns and best practices

· Working understanding of Security systems and applications including Firewalls, Encryption, X.509 public key infrastructure

Education

Any Graduate