Description

We have an immediate need for SRE Cloud Engineers with expertise in infrastructure runbook automation, system troubleshooting, and monitoring/dashboard creation for cloud services and Kubernetes (K8s) clusters at Arlinton, TX

This role involves developing automated solutions for infrastructure management, creating alerting templates, and collaborating with application teams to define Service Level Objectives (SLOs) and Service Level Indicators (SLIs). You will also be responsible for establishing and documenting High Availability (HA) and Disaster Recovery (DR) best practices to ensure system resilience.
 

The ideal candidates will possess strong scripting and automation skills, experience with complex system design and solutions, and a solid understanding of observability tools such as Splunk, Azure Monitor, and Grafana. You will work closely with cross-functional teams to enhance system reliability, create comprehensive monitoring frameworks, and ensure rapid incident response. Expertise in Azure cloud solutions and familiarity with large-scale, distributed systems are critical for this role.

Successful candidates should be detail-oriented, capable of solving complex technical issues, and comfortable working in dynamic cloud environments. Your ability to drive automation, improve operational efficiency, and maintain best practices across cloud infrastructure will be essential.

Education

Any Graduate