Key Skills: Cloud, Kubernetes, Python, Jenkins, OpenTelemetry, AppDynamics, Site Reliability Engineer.
Roles & Responsibilities:
- Design, implement, and manage cloud infrastructure to ensure high availability and reliability.
- Utilize Kubernetes for container orchestration and management.
- Develop and maintain monitoring solutions using OpenTelemetry and AppDynamics.
- Automate deployment processes using Jenkins.
- Collaborate with cross-functional teams to troubleshoot and resolve issues in production environments.
- Continuously improve system performance and reliability through proactive monitoring and incident response.
- Participate in on-call rotations to ensure uptime and handle high-severity incidents.
- Establish SLOs, SLIs, and SLAs in line with business expectations.
- Implement automated recovery solutions and contribute to chaos engineering practices.
- Support CI/CD pipelines by integrating observability and automated validation checks.
- Document root cause analyses, incident retrospectives, and system architecture changes.
Experience Required:
- 5 - 8 years of experience managing large-scale cloud environments with Kubernetes and cloud-native tools.
- Strong scripting skills in Python for automation and tooling.
- Demonstrated ability to configure and optimize Jenkins pipelines for continuous integration and deployment.
- Hands-on experience implementing distributed tracing and observability using OpenTelemetry.
- Working knowledge of performance monitoring and diagnostics with AppDynamics.
- Experience handling production incidents, driving post-incident reviews, and implementing preventive measures.
- Familiarity with cloud cost optimization and resource scaling best practices.
Education: Any Graduation