Overall Purpose
The Senior Site Reliability Engineer plays a critical role in the maintaining the reliability, availability, and performance . This position requires a blend of software engineering, systems engineering, and operation skills to build and maintain scalable, fault-tolerant systems.
Essential Functions
· Build systems set to measure and track SLOs and hold engineering teams accountable to meeting them
· Create and maintain relevant and documentation codifying institutional knowledge on applications
· Respond and participate in the recovery of business-critical incidents as port of our Incident Management Program
· Design and implement software and tools to improve the performance - availability, scalability, and latency, while delivering end products to customer with the highest efficiency and meeting all security standards
· Build automation and tooling around application management, such as deployments, configuration changes and disaster recovery scenarios
· Evaluate capacity of the application on a continuous basis to provide stats to the Product/Business teams and provide an efficient path to scale the infrastructure for future needs
· Identify performance bottlenecks and work with other infrastructure teams to troubleshoot and resolve issues.
· Implement standards across multiple disciplines, systems and practices to improve the overall application delivery.
· Work directly with development teams to provide feedback and technical requirements to the software development lifecycle.
· Participate in a 24X7 on-call rotation and act as an escalation point for the application and other underlying systems.
· Support the company's commitment to protect the integrity and confidentiality of systems and data.
Minimum Qualifications
· Education or experience equivalent to a Bachelor’s degree in computer science or engineering
· Proficiency in at least one programming language, eg. Python, Go, Java
· Working understanding of computer networking concepts including TCP/IP, UDP, Ethernet, Load Balancing, Application Proxies, etc
· Working understanding of Linux Unix systems with demonstrated ability to effectively troubleshoot, identify and resolve problems.
· Working knowledge Containerization technologies including Kubernetes and Docker
· Proficiency in Infrastructure as Code, eg. Terraform and Ansible
· Working knowledge of Git and CI/CD Tools, eg. Jenkins, Harness, GitLab, CI
· Comfort with facilitating collaboration, open communication and reaching across functional borders
· Ability to work and multi-task in a fast-paced environment.
· Excellent oral and written communication and interpersonal skills.
· High level of customer responsiveness, excellent documentation and communication skills and attention to detail.
· Hands-on experience in supporting applications in a 24X7 customer-facing production environment
· Background and drug screen.
Preferred Qualifications
· Experience with large-scale distributed systems with high uptime and performance requirements
· Working knowledge of RDBMS and NoSQL database systems
· Ability to write and debug SQL
· Working knowledge and understanding of Cloud application design patterns and best practices
· Working understanding of Security systems and applications including Firewalls, Encryption, X.509 public key infrastructure
Any Graduate