Site Reliability Engineering

Client needs a Site Reliability Engineer (SRE) who knows how to balance going fast and going big with operating safely.
Our mission is to progress, protect, and provide for the software and systems behind all of ’s public services - Analytics, Campaign, Data Platform, Dynamic Form Services, Advertising Cloud, Primetime, Target to name just a few - with an ever-watchful eye on their availability, latency, performance, and capacity.
SRE is a mindset of engineering approaches which focuses on building the highly reliable systems and eliminate work through automation.
We hire people from both systems and software backgrounds. Strong candidates will have experience with both.
The engineer role within SRE is at the heart of fulfilling SRE’s mission: build highly reliable, scalable & measurable customer experience for the continued growth of client’s infrastructure.
For this position, exceptional critical thinking, problem solving and in-depth technical skills are necessary.
A very good balance of process-oriented thinking skills and experience in managing customer expectations is a must.
Successful candidate must be able to function at a high level in critical situations.

Essential Job Functions:

Engage with product and engineering team from Day 1 to design, build and maintain the system / software for high availability proactively.
Write software layers, scripts, deployment frameworks, tracers, monitors, self-healing/auto remediation tools and automate the processes.
Build and maintain software modules for use and re-use in cloud systems automation.
Maintain the business continuity by identifying and drive opportunity of making systems highly resilient and human free.
Even after self-healing and automation done by you – if EXTREME complex issues arise, get involved into troubleshooting and root-cause analysis of issues across the stacks – hardware, software, database, network and so on.
Participate in shared on-call schedule [follow-the-sun model] managed across SRE & Engineering.

Qualifications:

Excellent ) and automation skills.
Experience in working over Databases(PostgreSql etc.)
Experiences / Worked in systems developed using Java, Python, Shell Scripting.
Troubleshooting and system engineering exposure in UNIX/RHEL production environments.
Decent Experience with Linux, Internet Protocols, and Large-Scale Operations.
Developing, running, and/or consuming cloud technologies such as AWS, Azure, OpenStack, Google Cloud Platform.
Ability to work independently and own problem statements end-to-end.
Great communication, interpersonal and teamwork skills.

Bonus Skills

Experience designing for and dealing with a large production environment.
Experience with Container and Cloud technologies like Docker/Kubernetes.
Recent large-scale experience with configuration management using tools such as Saltstack, Ansible, Chef or Puppet.

Technical Expertise

Production Experience /Strong Troubleshooting

DNS
ISO OSI stack
Troubleshooting
- Web based application
- Performance
- Network
- SSH/SSL/SFTP etc.
Security

Development/Scripting (Good to have)

Education

Any Gradute

Back To Jobs