Description

  • Client needs a Site Reliability Engineer (SRE) who knows how to balance going fast and going big with operating safely.
  • Our mission is to progress, protect, and provide for the software and systems behind all of ’s public services - Analytics, Campaign, Data Platform, Dynamic Form Services, Advertising Cloud, Primetime, Target to name just a few - with an ever-watchful eye on their availability, latency, performance, and capacity.
  • SRE is a mindset of engineering approaches which focuses on building the highly reliable systems and eliminate work through automation.
  • We hire people from both systems and software backgrounds. Strong candidates will have experience with both.
  • The engineer role within SRE is at the heart of fulfilling SRE’s mission: build highly reliable, scalable & measurable customer experience for the continued growth of client’s infrastructure.
  • For this position, exceptional critical thinking, problem solving and in-depth technical skills are necessary.
  • A very good balance of process-oriented thinking skills and experience in managing customer expectations is a must.
  • Successful candidate must be able to function at a high level in critical situations.

Essential Job Functions:

  • Engage with product and engineering team from Day 1 to design, build and maintain the system / software for high availability proactively.
  • Write software layers, scripts, deployment frameworks, tracers, monitors, self-healing/auto remediation tools and automate the processes.
  • Build and maintain software modules for use and re-use in cloud systems automation.
  • Maintain the business continuity by identifying and drive opportunity of making systems highly resilient and human free.
  • Even after self-healing and automation done by you – if EXTREME complex issues arise, get involved into troubleshooting and root-cause analysis of issues across the stacks – hardware, software, database, network and so on.
  • Participate in shared on-call schedule [follow-the-sun model] managed across SRE & Engineering.

Qualifications:

  • Excellent ) and automation skills.
  • Experience in working over Databases(PostgreSql etc.)
  • Experiences / Worked in systems developed using Java, Python, Shell Scripting.
  • Troubleshooting and system engineering exposure in UNIX/RHEL production environments.
  • Decent Experience with Linux, Internet Protocols, and Large-Scale Operations.
  • Developing, running, and/or consuming cloud technologies such as AWS, Azure, OpenStack, Google Cloud Platform.
  • Ability to work independently and own problem statements end-to-end.
  • Great communication, interpersonal and teamwork skills.

Bonus Skills

  • Experience designing for and dealing with a large production environment.
  • Experience with Container and Cloud technologies like Docker/Kubernetes.
  • Recent large-scale experience with configuration management using tools such as Saltstack, Ansible, Chef or Puppet.

Technical Expertise

  • Linux Administration
    • RHCE/RHCA
    • User Management
    • File System & package Management
    • Cloud –
      • AWS at least
      • Azure

Production Experience /Strong Troubleshooting

  • DNS
  • ISO OSI stack
  • Troubleshooting
    • Web based application
    • Performance
    • Network
    • SSH/SSL/SFTP etc.
  • Security

Development/Scripting (Good to have)

  • Languages (at least one)
    • Python
    • Bash
    • Java
    • JS
  • Config Management
    • Ansible
    • Salt

Education

  • Graduate (B.Tech) in Computer Science
  • 4-10 years of relevant work experience

Education

Any Gradute