Description

The Senior Site Reliability Engineer (SRE) is responsible for designing, building, and maintaining scalable, reliable infrastructure and CI/CD systems. This role blends software engineering with systems engineering to ensure high availability, performance, and developer productivity. You will lead incident response, automation, and observability efforts while mentoring junior engineers and collaborating across teams.

 

Key Responsibilities:

·      Design and maintain scalable infrastructure on cloud platforms (AWS, GCP, Azure).

·      Build and manage CI/CD pipelines and DevOps tooling (GitHub, Terraform, YAML).

·      Lead incident response, root cause analysis, and post-mortem processes.

·      Develop observability tools (monitoring, logging, alerting).

·      Automate infrastructure and operational tasks to reduce manual toil.

·      Collaborate with software teams to improve deployment and reliability.

·      Advocate for containerization and orchestration (Docker, Kubernetes).

·      Implement and maintain Infrastructure as Code (IaC) practices.

·      Drive initiatives like chaos engineering, load testing, and auto-scaling.

·      Contribute to capacity planning, cost optimization, and SLOs.

·      Provide Level-3 support for CI/CD systems and participate in on-call rotations.

 

Qualifications:

·      8+ years in SRE, DevOps, or infrastructure engineering.

·      Strong experience with CI/CD, Terraform, and cloud platforms.

·      Proficiency in scripting (Python, Bash, Go).

·      Deep knowledge of Linux systems and networking.

·      Experience with observability tools (New Relic, Splunk, Datadog).

·      Familiarity with microservices and distributed systems.

 

 

Education

Any Gradute