Description

Key Skills: Software Design, Rapid Prototyping, Agile Development, Technical Mentoring.

Roles and Responsibilities:

  • Maintain availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of systems.
  • Identify and resolve system issues and failures to ensure continuous system reliability.
  • Ensure the reliability of infrastructure environments, preventing downtime or performance bottlenecks.
  • Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault identification.
  • Partner with development teams to enhance services through rigorous testing and release procedures.
  • Utilize automation tools to monitor and observe software reliability in production environments.
  • Lead and drive platform-first initiatives to ensure scalability, reliability, and performance of the technology platform.
  • Enhance availability, reliability, and performance of critical systems and services.
  • Design and develop fully automated workflows using scripting languages such as JavaScript, PowerShell, and Bash.

Experience Requirement:

  • 5-8 years of experience with SRE and Observability concepts.
  • Strong knowledge or experience with monitoring tools like Prometheus, Grafana, ITRS, and AppDynamics.
  • Hands-on experience with DevOps tools such as Jenkins, TeamCity, Ansible, and uDeploy.
  • Strong experience in scripting and automation using Python and Shell scripting.

Education: B.E., B.Tech, B. Sc.

Education

Any Graduate