Senior-level SRE responsible for ensuring reliability, performance, and scalability of GCP-based
platforms supporting a global cloud environment. Focus on automation, observability, and
incident response for mission-critical applications.
?Minimum: 5+ years in Site Reliability Engineering or Platform Engineering
• Preferred: 7+ years with enterprise-scale cloud environments
• Industry: Experience in high-availability, customer-facing systems preferred
Advanced monitoring and observability (Prometheus, Grafana, New Relic, Datadog)
• Incident management and post-mortem analysis
• SLI/SLO definition and measurement
• Chaos engineering and reliability testing
• Performance tuning and capacity planning
• Automation and scripting (Python, Go, Bash)
Infrastructure as Code (Terraform, Ansible)
• Container orchestration (Kubernetes, Docker)
• CI/CD pipeline design and implementation
• Microservices architecture and distributed systems
• Load balancing and traffic management
• Database performance optimization
Compute: GCE, GKE, Cloud Run, App Engine
• Monitoring: Cloud Operations Suite (Stackdriver), Cloud Logging, Cloud Monitoring
• Networking: VPC, Cloud Load Balancing, Cloud CDN
• Storage: Cloud Storage, Persistent Disks, Cloud SQL
• Security: IAM, VPC Security, Cloud KMS
Cloud Trace and Cloud Profiler for APM
• Cloud Deployment Manager and Cloud Build
• Anthos for hybrid/multi-cloud management
• Error Reporting and Cloud Debugger
• BigQuery for log analysis and metrics
Any Gradute