We are seeking an experienced Site Reliability Engineer (SRE) with a proven track record in managing the reliability, scalability, and performance of services. This role will involve working in a dynamic environment and collaborating with cross-functional teams to ensure the stability of our infrastructure and applications.
Key Responsibilities:
AWS Infrastructure Management:
- Design, implement, and maintain scalable cloud infrastructure using AWS services to ensure high availability and performance.
- Monitor AWS resources and utilization to optimize cost and performance efficiency.
CI/CD Pipeline Development:
- Develop and maintain continuous integration and continuous deployment (CI/CD) pipelines using Jenkins and AWS tools.
- Automate deployment processes and facilitate smooth transitions from development to production environments.
Incident Management and Prevention:
- Lead incident response efforts, coordinate with DevOps teams to quickly resolve production issues, and minimize downtime.
- Conduct post-incident reviews to identify root causes and recommend improvements to prevent future incidents.
Distributed Systems Expertise
- Apply strong knowledge of distributed systems to maintain robust networking configurations and operating systems.
- Troubleshoot complex systems and ensure that all components are functioning correctly in a distributed architecture.
Automation and Configuration Management:
- Utilize automation tools to streamline operations and configuration management processes across environments.
- Implement best practices for system automation to improve efficiency and reliability.
Scripting and Development:
- Write and maintain scripts in Python, and other languages as required, to automate tasks and enhance system operations.
- Collaborate with software development teams to contribute to application code where necessary, ensuring reliability and performance are considered.
Cross-Functional Collaboration:
- Work closely with various teams, including development, QA, and operations, to foster a culture of collaboration and shared responsibility for reliability, ensuring that potential candidates feel valued and part of a team.
- Communicate effectively using clear technical documentation and status updates to stakeholders and team members.
Technical Skills Required
Programming Languages
- PHP, Python, Perl, Ruby, Java, and C++. Ability to develop and modify code to aid system functionality.
Cloud Technologies
- Strong knowledge of private cloud technologies, including OpenStack, OpenNebula, and Apache CloudStack. Experience in setting up and managing private cloud environments.
Virtualization and Containers
- Familiarity with virtual machine platforms such as VMware's vSphere and Linux KVM, as well as container solutions like Docker, OpenVZ, and Cloud Foundry.
Monitoring and Observability
- Experience using monitoring and performance management tools like Datadog, AppDynamics, Stackify, SolarWinds, and Dynatrace, to track system performance and detect anomalies.
Bachelor's degree in Computer Science