Description

Job Overview: We are seeking a highly skilled Senior Site Reliability Engineer (SRE)to join our dynamic team. This role demands extensive experience in both frontend and backend development, along with a strong grasp of cloud technologies and database management. You will work closely with the Engineering team, Product team, and other stakeholders to design and implement scalable, secure, and high-performance solutions. As a technical leader, you will ensure adherence to best practices, provide mentorship, and drive cross-functional collaboration.

Responsibilities of Senior SRE:

● The Site Reliability Engineering (SRE) team is responsible for the reliability, scalability, stability and performance of systems and services.

● They work with cross-functional teams to design, build and maintain systems and they troubleshoot issues when they arise. They bridge the gap between development and operations teams.

● They work closely with business teams to define Service Level Objectives (SLO) and agreements (SLA) of critical systems. They also monitor and maintain the uptime of these systems in-line with the defined SLO’s and SLA’s.

● They deploy and manage monitoring tools to gain insights on system health and performance.

● They analyze performance, identify bottlenecks and implement solutions to improve a system’s scalability and latency durations.

● They develop scripts, implement tools and automation frameworks to reduce the manual intervention efforts of deployment, monitoring and scaling.

● They work with development teams for design and development of observability practices like logging, metrics, tracing, etc. They aim to diagnose and troubleshoot issues proactively.

● They create actionable alerts on monitoring systems to ensure rapid response for potential production incidents.

● They forecast resource needs and provision adequately for current and future demand.

● They design and execute “chaos experiments” to test system’s failure resiliency.

● They own, define and implement the Disaster Recovery (DR) processes for systems. They also conduct planned and unplanned mock DR drills to test for response preparedness during production incidents.

● They ensure that security best practices are followed and implemented during design and operations of systems.

● They also own and maintain documentation of processes, playbooks, and systems.

● They publish KPI reports and other system health updates on a regular basis to the business.

 Requirements:

○ Must-have - Bachelor's degree, preferably in CS or a related field, or equivalent experience

○ Must-have - 12+ years of overall IT experience

Must-have - 7+ years of proven work experience as a Senior Site Reliability Engineer or a similar position.

○ Must-have - 5+ years of AWS Cloud experience with AWS Certified DevOps Engineer or SysOps or Security etc.

○ Must-have - AWS experience - 3+ years’ experience with using a broad range of AWS technologies (e.g. EC2, RDS, ELB, S3, VPC, CloudWatch & Monitoring Tools) to develop and maintain an Amazon AWS based cloud solution, with an emphasis on best practice cloud security.

○ Must-have - 2+ years of experience in CDN and/or Cache systems like Fastly, Akamai, CloudFront, etc.

○ Proven Understanding & strong experience with Cloud deployments ( AWS / Docker/ Kubernetes)

○ Knowledge on provisioning IAC Tools like Terraform, Chef, Ansible, Shell, groovy, python, etc.

○ Experience with monitoring systems such as CloudWatch, NewRelic, Datadog/Splunk, ELK stack.

○ Experience managing cloud network resources (AWS Preferred) such as CloudWatch, VPC, URL proxies, private link, DNS, ACLs, firewalls, and C2S access points. ○ Platform or Application Engineering and Operational Knowledge in any of the CI/CD tooling like GitHub Actions, Jenkins, etc.

○ Experience in other tooling Technologies like JIRA, Bitbucket, Jenkins, Fortify, SonarQube, Nexus, Nexus IQ

○ Experience with configuration automation tools like Puppet/Ansible/Chef/Salt

○ Scripting Skills: Strong scripting (e.g. Bash & Python) and automation skills.

○ Operating Systems: Windows and Linux system administration.

○ Problem Solving: Ability to analyze and resolve complex infrastructure resource and application deployment issues

○ Strong attention to detail. Excellent verbal and written communication skills. Strong documentation skills. 

Good To Have

● Experience with Terraform/Ansible/Chef/Puppet

● Experience with GitHub Actions

● Experience with CloudFront, Fastly

Education

Any Graduate