Description

Position Overview
 

We are seeking a Cloud Operations Lead to manage and optimize cloud infrastructure operations across AWS, Azure, and GCP. The ideal candidate will have expertise in cloud administration, automation, incident management, and performance optimization to ensure high availability, security, and efficiency of cloud environments.
 

Key Responsibilities
 

Cloud Operations & Management
 

  • Oversee day-to-day cloud operations, ensuring optimal performance, security, and cost efficiency.
     
  • Manage multi-cloud environments (AWS, Azure, GCP) for compute, storage, networking, and security operations.
     
  • Implement monitoring, alerting, and logging using tools like CloudWatch, Azure Monitor, and GCP Operations Suite.
     
  • Ensure high availability, disaster recovery, and business continuity across cloud platforms.
     

Automation & Optimization
 

  • Develop and implement Infrastructure as Code (IaC) using Terraform, CloudFormation, or ARM templates.
     
  • Automate cloud infrastructure deployment, scaling, and maintenance using scripting (Python, PowerShell, Bash).
     
  • Optimize cloud cost by implementing cost governance and rightsizing recommendations.
     

Incident & Security Management
 

  • Lead incident response, troubleshooting, and root cause analysis for cloud-related issues.
     
  • Implement security best practices, identity & access management (IAM), and compliance controls across cloud platforms.
     
  • Collaborate with DevOps & Security teams to enforce policies and remediate vulnerabilities.
     

Collaboration & Leadership
 

  • Work closely with engineering, DevOps, and IT teams to ensure smooth cloud operations.
     
  • Define SLA, SLO, and KPI metrics for cloud service availability and reliability.
     
  • Mentor junior cloud engineers and drive cloud operational excellence.
     

Required Skills & Qualifications
 

Technical Skills
 

  • Expertise in AWS, Azure, and GCP cloud operations.
     
  • Strong knowledge of compute, networking, storage, security, and IAM across cloud platforms.
     
  • Experience with monitoring & observability tools (Prometheus, Grafana, CloudWatch, Azure Monitor, Stackdriver).
     
  • Proficiency in automation & IaC (Terraform, CloudFormation, Ansible).
     
  • Strong scripting skills in Python, Bash, PowerShell.
     
  • Hands-on experience with Kubernetes (EKS, AKS, GKE) and containerized workloads.
     

Experience
 

  • 5-8 years of experience in cloud operations, site reliability engineering (SRE), or cloud infrastructure management.
     
  • Proven track record in managing large-scale, multi-cloud environments.
     

Soft Skills
 

  • Strong problem-solving and incident management skills.
     
  • Ability to collaborate with cross-functional teams in a fast-paced environment.
     
  • Excellent documentation and reporting skills.

Education

Any Graduate