- Client needs a Site Reliability Engineer (SRE) who knows how to balance going fast and going big with operating safely.
- Our mission is to progress, protect, and provide for the software and systems behind all of ’s public services - Analytics, Campaign, Data Platform, Dynamic Form Services, Advertising Cloud, Primetime, Target to name just a few - with an ever-watchful eye on their availability, latency, performance, and capacity.
- SRE is a mindset of engineering approaches which focuses on building the highly reliable systems and eliminate work through automation.
- We hire people from both systems and software backgrounds. Strong candidates will have experience with both.
- The engineer role within SRE is at the heart of fulfilling SRE’s mission: build highly reliable, scalable & measurable customer experience for the continued growth of client’s infrastructure.
- For this position, exceptional critical thinking, problem solving and in-depth technical skills are necessary.
- A very good balance of process-oriented thinking skills and experience in managing customer expectations is a must.
- Successful candidate must be able to function at a high level in critical situations.
Essential Job Functions:
- Engage with product and engineering team from Day 1 to design, build and maintain the system / software for high availability proactively.
- Write software layers, scripts, deployment frameworks, tracers, monitors, self-healing/auto remediation tools and automate the processes.
- Build and maintain software modules for use and re-use in cloud systems automation.
- Maintain the business continuity by identifying and drive opportunity of making systems highly resilient and human free.
- Even after self-healing and automation done by you – if EXTREME complex issues arise, get involved into troubleshooting and root-cause analysis of issues across the stacks – hardware, software, database, network and so on.
- Participate in shared on-call schedule [follow-the-sun model] managed across SRE & Engineering.
Qualifications:
- Excellent ) and automation skills.
- Experience in working over Databases(PostgreSql etc.)
- Experiences / Worked in systems developed using Java, Python, Shell Scripting.
- Troubleshooting and system engineering exposure in UNIX/RHEL production environments.
- Decent Experience with Linux, Internet Protocols, and Large-Scale Operations.
- Developing, running, and/or consuming cloud technologies such as AWS, Azure, OpenStack, Google Cloud Platform.
- Ability to work independently and own problem statements end-to-end.
- Great communication, interpersonal and teamwork skills.
Bonus Skills
- Experience designing for and dealing with a large production environment.
- Experience with Container and Cloud technologies like Docker/Kubernetes.
- Recent large-scale experience with configuration management using tools such as Saltstack, Ansible, Chef or Puppet.
Technical Expertise
- Linux Administration
- RHCE/RHCA
- User Management
- File System & package Management
- Cloud –
Production Experience /Strong Troubleshooting
- DNS
- ISO OSI stack
- Troubleshooting
- Web based application
- Performance
- Network
- SSH/SSL/SFTP etc.
- Security
Development/Scripting (Good to have)
- Languages (at least one)
- Config Management
Education
- Graduate (B.Tech) in Computer Science
- 4-10 years of relevant work experience