Key Skills: Software Design, Rapid Prototyping, Agile Development, Technical Mentoring.
Roles and Responsibilities:
- Maintain availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of systems.
- Identify and resolve system issues and failures to ensure continuous system reliability.
- Ensure the reliability of infrastructure environments, preventing downtime or performance bottlenecks.
- Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault identification.
- Partner with development teams to enhance services through rigorous testing and release procedures.
- Utilize automation tools to monitor and observe software reliability in production environments.
- Lead and drive platform-first initiatives to ensure scalability, reliability, and performance of the technology platform.
- Enhance availability, reliability, and performance of critical systems and services.
- Design and develop fully automated workflows using scripting languages such as JavaScript, PowerShell, and Bash.
Experience Requirement:
- 5-8 years of experience with SRE and Observability concepts.
- Strong knowledge or experience with monitoring tools like Prometheus, Grafana, ITRS, and AppDynamics.
- Hands-on experience with DevOps tools such as Jenkins, TeamCity, Ansible, and uDeploy.
- Strong experience in scripting and automation using Python and Shell scripting.
Education: B.E., B.Tech, B. Sc.