Cloud Site Reliability Engineer

Key Skills: Cloud, Kubernetes, Python, Jenkins, OpenTelemetry, AppDynamics, Site Reliability Engineer.

Roles & Responsibilities:

Design, implement, and manage cloud infrastructure to ensure high availability and reliability.
Utilize Kubernetes for container orchestration and management.
Develop and maintain monitoring solutions using OpenTelemetry and AppDynamics.
Automate deployment processes using Jenkins.
Collaborate with cross-functional teams to troubleshoot and resolve issues in production environments.
Continuously improve system performance and reliability through proactive monitoring and incident response.
Participate in on-call rotations to ensure uptime and handle high-severity incidents.
Establish SLOs, SLIs, and SLAs in line with business expectations.
Implement automated recovery solutions and contribute to chaos engineering practices.
Support CI/CD pipelines by integrating observability and automated validation checks.
Document root cause analyses, incident retrospectives, and system architecture changes.

Experience Required:

5 - 8 years of experience managing large-scale cloud environments with Kubernetes and cloud-native tools.
Strong scripting skills in Python for automation and tooling.
Demonstrated ability to configure and optimize Jenkins pipelines for continuous integration and deployment.
Hands-on experience implementing distributed tracing and observability using OpenTelemetry.
Working knowledge of performance monitoring and diagnostics with AppDynamics.
Experience handling production incidents, driving post-incident reviews, and implementing preventive measures.
Familiarity with cloud cost optimization and resource scaling best practices.

Education: Any Graduation

Any Graduate

Back To Jobs