Description

Job Description :

Skill: Senior Site Reliability Engineer
6+ years of experience as a Site Reliability Engineer or equivalent in a similar role.
Proven experience in monitoring, analyzing, and optimizing the performance of large-scale distributed systems.
Track record of operating and supporting Kubernetes in production at scale - EKS preferred.
Expertise in Linux systems administration, including managing servers, operating systems, and network configurations.
Strong scripting and automation skills, preferably with experience in Bash, Python, or similar languages.
Familiarity with AWS.
Experience with DevOps tools and practices, such as GitLab CI/CD, and Docker.
Excellent troubleshooting and problem-solving skills with a knack for identifying and resolving complex technical issues.
Ability to work independently and as part of a collaborative team, effectively communicating technical concepts to both technical and non-technical stakeholders.
A passion for maintaining high availability, performance, and reliability of critical systems in a fast-paced financial environment.
Responsibilities:

Availability:
Proactively monitor and proactively identify potential issues that could impact the availability of our systems.
Implement and maintain automated alerting mechanisms to notify the appropriate parties of potential outages or performance degradation.
Collaborate with development teams to design and implement solutions that enhance system resilience and reduce downtime.
Latency:
Analyze performance metrics to identify and resolve latency bottlenecks in our infrastructure.
Implement performance optimization techniques and tools to improve the overall responsiveness of our systems.
Work with development teams to ensure that new features and code changes do not introduce performance regressions.
Performance:
Develop and maintain metrics dashboards to track key performance indicators (KPIs) for our critical systems.
Identify performance trends and anomalies that may indicate potential issues or areas for improvement.
Recommend and implement performance optimization strategies to enhance the overall efficiency of our systems.
Efficiency:
Optimize resource utilization and minimize unnecessary expenditure on IT infrastructure.
Collaborate with development teams to optimize resource allocation for new applications and services.
Release Management:
Participate in the release planning process to ensure that software releases are conducted smoothly and without disruptions.
Develop and implement automated deployment and rollback procedures to mitigate risks associated with software updates

Education

Any Graduate