Description

Ensure reliability of cloud-based distributed systems infrastructure & services built to seamlessly scale to 10s of billions of events per day. 
Responsible for the availability, performance, monitoring, emergency response, and capacity planning of the Traceable cloud services & infrastructure. 
Responsible for building and maintaining ultra modern infrastructure for CI/CD and DevOp.
Responsible for debugging and solving production issues & escalations working with rest of engineering team
Collaborated with product engineering teams across time zones on design and operations of systems and services.
Lead, mentor, and manage a team of Site Reliability Engineers to ensure optimal performance and career growth. Establish team goals and objectives aligned with the company’s strategic vision. Foster a culture of continuous improvement, collaboration, and innovation within the SRE team.

Qualifications

Bachelor’s or Master’s degree in computer science
10+ years of work experience in SRE & DevOps with modern cloud native tech stack, distributed systems at massively large scale
Strong experience with cloud native technologies (AWS/GCP, microservices Containers, Kubernetes etc) at scale
Strong experience in streaming systems like Kafka streams or Flink
Hands-on experience in setting up, automating and continuously improving the deployment pipelines and CI/CD infrastructure
Strong experience with linux systems
Strong experience of operationalizing and scaling modern data systems like MongoDB, Apache Pinot, Apache Trino, Spark, Apache Iceberg and Kafka Streams
Strong Experience in infrastructure deployment/provisioning as code using modern tools (Terraform, Helm, Ansible etc)
Good expertise in Java & Scripting
Strong troubleshooting & debugging skills for production issues & escalations
Experience working in a distributed team with different time zones
A self starter with the ability to work effectively in teams and fast faced start set-up
Excellent spoken / written communication

Education

Bachelor's degree in Computer Science