Description

Key Responsibilities:

Cloud Infrastructure Management:

  • Manage, monitor, and optimize cloud infrastructure across platforms (e.g., AWS, Azure).
  • Ensure high availability, scalability, and cost-efficiency of cloud systems.
  • Oversee deployment and maintenance of applications and services in the cloud environment.


Monitoring Expertise:

  • Design and implement advanced monitoring, alerting, and observability solutions using Monitoring tools like (Datadog, Grafana, Prometheous).
  • Configure dashboards, custom metrics, and anomaly detection to provide deep insights into system performance.
  • Conduct training sessions for the customer's team on effective Datadog usage.


Incident and Problem Management:

  • Take ownership of incident management, ensuring rapid detection, escalation, and resolution of issues.
  • Oversee real-time incident detection, escalation, and resolution processes.
  • Perform root cause analysis and implement long-term solutions to prevent recurrence.
  • Develop and enforce operational playbooks for handling critical incidents.


Security and Compliance:

  • Ensure adherence to security best practices and compliance with customer and industry standards.
  • Collaborate with security teams to implement identity and access controls, encryption, and vulnerability management.


Reporting and Optimization:

  • Generate and present regular operational reports to customer stakeholders, including SLA adherence and performance metrics.
  • Analyze trends to identify areas for optimization and proactively recommend improvements.


Leadership and Collaboration:

  • Lead and mentor the CloudOps team to deliver top-notch operational performance.
  • Collaborate with development, QA, and security teams to align operations with business goals.
  • Present operational reports and insights, including SLA adherence and mobile application performance metrics.


Process Improvement and Automation:

  • Continuously analyze current processes and identify areas for improvement.
  • Implement automation tools and techniques to enhance efficiency.
  • Establish and document standard operating procedures (SOPs) for NOC operations.
  • Implement Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, or Ansible.
  • Automate repetitive operational tasks to enhance efficiency.


Required Qualifications:

  • Bachelor's degree in Computer Science, IT, or a related field; a Master's degree is a plus.
  • 8+ years of experience in cloud operations, with at least 3 years in a leadership role.
  • Strong expertise in monitoring tools like Datadog, Grafana, Prometheous including advanced configuration and monitoring setup.
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Hands-on experience with automation tools like Terraform, Ansible, or CloudFormation.
  • Solid understanding of DevOps practices, CI/CD pipelines, and container orchestration (e.g., Docker, Kubernetes).


Certifications:
Datadog certifications or proven expertise in the platform is a significant advantage.

Soft Skills:

  • Strong leadership and team management skills with the ability to work onsite in a customer-facing role.
  • Excellent communication and interpersonal skills for effective collaboration with stakeholders.
  • Proactive and solution-oriented mindset to drive improvements and resolve challenges.
  • Ability to work under pressure, prioritize tasks, and manage multiple priorities effectively.


 

Education

Any Graduate