We are seeking a skilled and motivated DevOps Site Reliability Engineer to join our dynamic team. As a DevOps SRE Engineer, you will play a crucial role in the design, development, and maintenance of our infrastructure, ensuring seamless and efficient software delivery processes. Your primary responsibilities will include automating tasks, managing continuous integration and continuous deployment (CI/CD) pipelines, and maintaining system health. You will also be responsible for incident response, post-incident analysis, and implementing preventive measures to enhance system reliability.
Years of experience needed:
- 4-6 years of experience in DevOps as Site Reliability Engineer.
- Telecommunication Billing experience is preferred.
Technical Skills:
- Proficiency in scripting languages, especially Bash and Python, for automation.
- Hands-on experience on AWS's managed Kubernetes i.e. Amazon Elastic Kubernetes Service (EKS) with Helm package manager.
- Experience to work with Splunk and OpenTelemetry.
- Must have worked with Amazon ElastiCache clusters for Redis, and with Amazon RDS (Relational Database Service)
- Experience with CI/CD tools like Jenkins and GitLab CI/CD, and strong pipeline management skills.
- Familiarity with version control systems, particularly Git, and collaboration platforms.
- Knowledge of infrastructure-as-code (IAC) principles and tools, such as Ansible and Terraform.
- Strong problem-solving skills, a proactive approach to system health, and the ability to troubleshoot complex issues.
- A solid background in system administration, infrastructure management, or software engineering.
- Experience in incident response, post-incident analysis, and implementing preventive measures.
- Familiarity with observability tools, monitoring, and alerting systems.
- A commitment to balancing reliability concerns with continuous innovation and development.
Certifications Needed:
Key Responsibilities:
- Utilize proficiency in scripting languages such as Bash and Python to automate tasks and streamline operational processes.
- Implement and manage CI/CD tools like Jenkins and GitLab CI/CD to enable automated software deployment and integration.
- Leverage AWS's managed Kubernetes i.e. Amazon Elastic Kubernetes Service (EKS) offering to achieve robust, scalable, and reliable containerized application deployments.
- Leverage Package manager for Kubernetes, such as Helm, which allows users to define, install, and upgrade complex applications and their dependencies using "charts."
- Leverage Splunk and OpenTelemetry to achieve comprehensive observability and improve system reliability for monitoring, troubleshooting, and optimizing modern, distributed systems.
- Automating the provisioning, configuration, scaling, and patching of Amazon ElastiCache clusters for Redis with Amazon RDS (Relational Database Service).
- Maintain version control systems, particularly Git, and collaborate with cross-functional teams using collaboration platforms.
- Apply knowledge of infrastructure-as-code (IAC) principles and tools, including Ansible and Terraform, to manage and scale our infrastructure.
- Demonstrate strong problem-solving skills to troubleshoot complex issues efficiently.
- Leverage your background in system administration, infrastructure management, or software engineering to optimize our systems.
- Actively participate in incident response, post-incident analysis, and take preventive measures to ensure system stability and reliability.
- Implement observability tools, monitoring, and alerting systems to proactively identify and resolve potential issues.
- Strike a balance between ensuring system reliability and promoting continuous innovation and development