Serve as Lead Systems Engineer specializing in Datadog within AWS cloud environments.
Oversee backend administration, engineering support, design, and implementation of Datadog-based monitoring and observability solutions.
Ensure high availability, performance, and reliability of cloud-based services and infrastructure.
Architect, deploy, and manage AWS resources (EC2, RDS, Lambda, ECS/EKS, S3, etc.) following best practices for scalability, security, and cost optimization.
Define and enforce monitoring strategies, dashboards, alerts, custom metrics, and best practices using Datadog.
Lead integration of Datadog with ServiceNow for automated incident management, event correlation, and CMDB synchronization.
Provide technical leadership, mentorship, and guidance to junior engineers and cross-functional teams.
Collaborate with IT and engineering teams to integrate monitoring/logging into CI/CD pipelines and cloud infrastructure.
Drive continuous improvement in system reliability (including SLO/SLI definitions, synthetic monitoring, and anomaly detection).
Contribute to and enforce Infrastructure as Code (IaC) standards using tools like Terraform or CloudFormation.
Participate in high-severity incident management, root cause analysis, and implementation of corrective actions.
Work remotely, with possible occasional meetings; must be located in approved states.
Report to and support health insurance customers; must have relevant industry experience