Description

  • Programming & Scripting: Proficiency in Python, Go, Bash, or similar languages.
  • Cloud Platforms: Experience with AWS, GCP, or Azure.
  • Infrastructure as Code (IaC): Hands-on experience with Terraform, CloudFormation, or similar tools.
  • Configuration Management: Experience with Ansible, Puppet, or Chef.
  • Monitoring & Logging: Familiarity with tools like Prometheus, Grafana, ELK Stack, Datadog, New Relic, or Splunk.
  • CI/CD Pipelines: Experience with Jenkins, GitHub Actions, GitLab CI/CD, or ArgoCD.
  • Networking & Security: Understanding of firewalls, VPNs, load balancing, and network troubleshooting.
  • Containers & Orchestration: Experience with Docker and Kubernetes.
  • Database Management: Knowledge of SQL and NoSQL databases like MySQL, PostgreSQL, MongoDB, or Cassandra.

 

Soft Skills

  • Strong problem-solving and analytical skills.
  • Ability to work under pressure in a fast-paced environment.
  • Excellent communication and collaboration skills to work across teams.
  • Strong attention to detail and proactive in identifying system weaknesses.

 

Key Responsibilities

  • System Reliability & Performance Design, develop, and maintain highly available, scalable, and fault-tolerant systems. Implement observability tools, including monitoring, logging, and alerting solutions, to ensure system uptime.
  • Conduct root cause analysis and post-mortems for incidents and outages to prevent recurrence.
  • Automation & Infrastructure as Code (IaC) Automate repetitive operational tasks using scripting and configuration management tools (e.g., Terraform, Ansible, Puppet, Chef). Develop CI/CD pipelines to streamline deployments and improve release management. Maintain and improve Infrastructure as Code (IaC) for cloud and on-premises environments.
  • Incident Management & Troubleshooting Act as a first responder for critical system incidents, ensuring quick resolution and minimal downtime. Collaborate with engineering teams to diagnose and fix production issues.
  • Participate in on-call rotations for 24/7 system monitoring and support.
  • Cloud & Infrastructure Management Deploy, manage, and optimize cloud services (AWS, GCP, Azure) or on-premises infrastructure. Ensure cost-efficient cloud usage and scalability through resource management and auto-scaling strategies.
  • Implement security best practices for system architecture and cloud environments.
  • Software Development & Performance Optimization Develop internal tools and scripts to enhance operational efficiency. Optimize database queries, API performance, and infrastructure to reduce latency and improve system performance.
  • Work closely with software engineers to embed reliability best practices into application design.
  • Capacity Planning & Disaster Recovery Forecast system capacity needs and implement strategies for scaling applications. Design and test disaster recovery and failover strategies for business continuity. Ensure backup and recovery mechanisms are in place and regularly tested

Education

Any Gradute