Description

JD
·        The ideal candidate will have a strong background in production monitoring, a deep understanding of development and operations, and a proven track record in managing and scaling distributed systems in a public, private, or hybrid cloud environments.
·        Understanding of SRE principles, including monitoring, alerting, fault analysis, and other common reliability engineering concepts, with a keen eye for opportunities to eliminate toil by code and process improvements.
·        Expertise in infrastructure as code (IAC), configuration management, build automation, source control, and CI/CD tools (e.g., Terraform, CloudFormation, Ansible, GitHub, Artifactory, Jenkins).
·        Deep understanding of containerization and orchestration technologies (e.g., Docker, Kubernetes).
·        Experience with monitoring and logging tools (e.g., Prometheus, Grafana, Dynatrace, Splunk) and incident response processes.
·        Proficient in Java, .NET, Web UI/JavaScript Frameworks and scripting languages such as Python, Bash, and PowerShell.
·        High-level understanding of the different layers of the Tech stack and how they come together to provide a service (e.g. network, compute, storage, OS (Linux, Windows), supporting services, application layer).

Responsibilities:
·        Key measures of success will include platform stability, effective integration and delivery, instrumentation, release quality, technical debt(toil) reduction, development of automation, risk/security compliance, and sustained advancement of the SRE practice.
·        Design & implement scalable, automated, monitored, and well-documented systems to accelerate the development of the services running in the AWS and Azure cloud.
·        Configure, tune, and fix multi-tiered systems to achieve optimal application performance, stability, and availability.
·        Be part of an on-call rotation providing hands-on technical expertise during service-impacting events.
·        Apply troubleshooting skills, debugging tools, and examine logs, telemetry, and other methods to verify assumptions and customer impact. Lead blameless postmortems for root cause and production resiliency.
 

Education

Any Graduate