Description

We are seeking an experienced Site Reliability Engineer (SRE) with a proven track record in managing the reliability, scalability, and performance of services. This role will involve working in a dynamic environment and collaborating with cross-functional teams to ensure the stability of our infrastructure and applications.

 

Key Responsibilities:

 

AWS Infrastructure Management:

- Design, implement, and maintain scalable cloud infrastructure using AWS services to ensure high availability and performance.

- Monitor AWS resources and utilization to optimize cost and performance efficiency.

 

CI/CD Pipeline Development:

- Develop and maintain continuous integration and continuous deployment (CI/CD) pipelines using Jenkins and AWS tools.

- Automate deployment processes and facilitate smooth transitions from development to production environments.

 

Incident Management and Prevention:

- Lead incident response efforts, coordinate with DevOps teams to quickly resolve production issues, and minimize downtime.

- Conduct post-incident reviews to identify root causes and recommend improvements to prevent future incidents.

 

Distributed Systems Expertise

- Apply strong knowledge of distributed systems to maintain robust networking configurations and operating systems.

- Troubleshoot complex systems and ensure that all components are functioning correctly in a distributed architecture.

 

Automation and Configuration Management:

- Utilize automation tools to streamline operations and configuration management processes across environments.

- Implement best practices for system automation to improve efficiency and reliability.

 

Scripting and Development:

- Write and maintain scripts in Python, and other languages as required, to automate tasks and enhance system operations.

- Collaborate with software development teams to contribute to application code where necessary, ensuring reliability and performance are considered.

 

Cross-Functional Collaboration:

- Work closely with various teams, including development, QA, and operations, to foster a culture of collaboration and shared responsibility for reliability, ensuring that potential candidates feel valued and part of a team.

- Communicate effectively using clear technical documentation and status updates to stakeholders and team members.

 

Technical Skills Required

 

Programming Languages

- PHP, Python, Perl, Ruby, Java, and C++. Ability to develop and modify code to aid system functionality.

 

Cloud Technologies

- Strong knowledge of private cloud technologies, including OpenStack, OpenNebula, and Apache CloudStack. Experience in setting up and managing private cloud environments.

 

Virtualization and Containers

- Familiarity with virtual machine platforms such as VMware's vSphere and Linux KVM, as well as container solutions like Docker, OpenVZ, and Cloud Foundry.

 

Monitoring and Observability

- Experience using monitoring and performance management tools like Datadog, AppDynamics, Stackify, SolarWinds, and Dynatrace, to track system performance and detect anomalies.

Education

Bachelor's degree in Computer Science