Job Description:
- Monitoring and Alerting: Implement and maintain monitoring systems to proactively identify potential issues and alert engineers to problems before they impact users.
- Incident Response: Respond to incidents and outages, diagnose problems, and implement solutions to minimize downtime and restore service.
- Automation: Automate repetitive tasks and processes to improve efficiency and reduce manual effort.
- Performance Optimization: Identify and address performance bottlenecks to ensure systems run efficiently and effectively.
- Infrastructure Management: Manage and maintain the underlying infrastructure, including servers, networks, and cloud resources.
- Capacity Planning: Plan for future capacity needs to ensure systems can handle anticipated workloads.
- Release Engineering: Develop and maintain processes for deploying software updates and releases.
- Collaboration: Work closely with developers, operations teams, and other stakeholders to ensure system reliability and availability.
- Documentation: Maintain clear and concise documentation of systems, processes, and procedures.
- Continuous Improvement: Identify areas for improvement and implement changes to enhance system reliability and performance.
Skills and Qualifications:
- Cloud Platform (AWS, Microsoft Azure).
- Automation (DevOps, CI/CD, Terraform).
- Operating System (Windows, Linux).
- Scripting (Shell Scripting, Python, PowerShell).
- Database (MySQL, Oracle, SQL database management).
- Application Deployment (Wild Fly, JBoss, Apache Tomcat).
- Container Services (Kubernetes, Docker, Helm).