Description

Key Responsibilities:

  • Monitor live production environments (e.g., servers, databases, APIs).
  • Troubleshoot system outages, latency issues, or software bugs in real-time.
  • Work with engineering and DevOps teams to escalate or resolve critical incidents (e.g., SEVs).
  • Deploy or assist with deployments, patches, or hotfixes.
  • Analyze logs, metrics, and monitoring dashboards (e.g., Datadog, Splunk, Grafana).
  • May be on-call for after-hours incidents.

Skills Expected:

  • Understanding of system architecture (web apps, cloud services, etc.).
  • Familiarity with scripting (Bash, Python) or querying (SQL, log tools).
  • Comfort with tools like Jira, Jenkins, Git, or AWS.
  • Stack Trace experience
  • Ability to triage high-pressure incidents.

Typical Environment:

  • Internal teams at SaaS companies, fintechs, banks, etc.
  • Often part of an SRE, DevOps, or IT operations group

Education

Any Gradute