Description

We are seeking an experienced DevOps Engineer to join our enterprise operations team. This role is critical to ensuring the availability, scalability, and performance of our monitoring and alerting systems. The ideal candidate will be an AWS-savvy engineer with deep expertise in observability tools like Datadog, event management platforms such as BigPanda, and best practices for infrastructure automation in cloud environments. You will work closely with development, infrastructure, and problem management teams to ensure high system uptime, precise alerting, and visibility across platforms.

Key Responsibilities

  • Manage and maintain enterprise monitoring systems, including configuration of alert templates and integration with BigPanda.
  • Ensure alert quality, CMDB integration, and adherence to AWS monitoring best practices.
  • Serve as the administrator for APM and observability tools (Datadog, BigPanda).
  • Create and maintain logging and indexing strategies to support development and operational visibility.
  • Develop and manage infrastructure configuration using CloudFormation and Serverless frameworks.
  • Collaborate with DevOps and other technical teams on escalations and root cause analysis.
  • Oversee uptime and availability reporting and dashboard development for operational insights.
  • Provide hands-on support during incidents and outages, and guide teams through event resolution.
  • Mentor and train support personnel on monitoring tools and operational processes.
  • Evaluate new technologies, participate in training, and stay current on emerging trends.
  • Perform other duties as assigned in support of global platform operations.

Required Skills & Qualifications

  • 10+ years of experience in Information Technology with a strong infrastructure background.
  • 4+ years of hands-on experience running production systems on AWS.
  • Proficiency with Datadog, including site and log monitoring, dashboard creation, and alert configuration.
  • Solid knowledge of AWS tools such as DynamoDB, S3, Cognito, EC2, and CloudFormation.
  • Hands-on experience with JavaScript and TypeScript.
  • Familiarity with serverless architectures and monitoring containers.
  • Experience in container technologies such as Docker and Kubernetes.
  • Working knowledge of Windows Server, IIS, and cloud-based infrastructure tools.
  • Experience with CI/CD pipelines and source control (Git, SVN, etc.).
  • Familiarity with Ansible or similar configuration management tools.
  • Strong troubleshooting skills across systems and networks.
  • Ability to create dashboards that report on cost and performance metrics.
  • Exceptional verbal and written communication skills across all organizational levels.
  • Strong collaboration, integrity, and technical ownership

Education

Any Gradute