Description

Key Skills: Cloud, Kubernetes, Python, Jenkins, OpenTelemetry, AppDynamics, Site Reliability Engineer.

Roles & Responsibilities:

  • Design, implement, and manage cloud infrastructure to ensure high availability and reliability.
  • Utilize Kubernetes for container orchestration and management.
  • Develop and maintain monitoring solutions using OpenTelemetry and AppDynamics.
  • Automate deployment processes using Jenkins.
  • Collaborate with cross-functional teams to troubleshoot and resolve issues in production environments.
  • Continuously improve system performance and reliability through proactive monitoring and incident response.
  • Participate in on-call rotations to ensure uptime and handle high-severity incidents.
  • Establish SLOs, SLIs, and SLAs in line with business expectations.
  • Implement automated recovery solutions and contribute to chaos engineering practices.
  • Support CI/CD pipelines by integrating observability and automated validation checks.
  • Document root cause analyses, incident retrospectives, and system architecture changes.

Experience Required:

  • 5 - 8 years of experience managing large-scale cloud environments with Kubernetes and cloud-native tools.
  • Strong scripting skills in Python for automation and tooling.
  • Demonstrated ability to configure and optimize Jenkins pipelines for continuous integration and deployment.
  • Hands-on experience implementing distributed tracing and observability using OpenTelemetry.
  • Working knowledge of performance monitoring and diagnostics with AppDynamics.
  • Experience handling production incidents, driving post-incident reviews, and implementing preventive measures.
  • Familiarity with cloud cost optimization and resource scaling best practices.

Education: Any Graduation

Education

Any Graduate