We are looking for a Site Reliability Engineer (SRE) with a strong foundation in Full Stack Development who has transitioned into the SRE role. The ideal candidate will have hands-on experience in software development, infrastructure automation, CI/CD pipelines, and cloud platforms. You will be responsible for ensuring system reliability, scalability, and performance while leveraging your development skills to automate operational tasks.
Key Responsibilities:
Reliability & Scalability: Ensure high availability, reliability, and scalability of applications and infrastructure.
Automation & CI/CD: Design, develop, and maintain CI/CD pipelines to automate deployments and infrastructure provisioning.
Monitoring & Incident Management: Implement robust monitoring, logging, and alerting solutions. Troubleshoot and resolve production issues effectively.
Infrastructure as Code (IaC): Manage cloud infrastructure using Terraform, CloudFormation, or similar tools.
Performance Optimization: Identify performance bottlenecks in applications and infrastructure and provide solutions.
Collaboration: Work closely with development teams to integrate SRE best practices and improve system resiliency.
Security & Compliance: Ensure compliance with security standards and best practices in deployment and operations.
Required Skills & Experience:
Development Background: Experience as a Full Stack Developer with expertise in JavaScript (Node.js, React, Angular, or Vue.js) or Python/Java/Go.
SRE & DevOps Skills: Hands-on experience in Kubernetes, Docker, CI/CD (Jenkins/GitHub Actions/GitLab CI), and cloud platforms (AWS/Azure/GCP).
Infrastructure as Code (IaC): Proficiency in Terraform, Ansible, or CloudFormation.
Observability Tools: Experience with Prometheus, Grafana, ELK Stack, Datadog, or New Relic.
Database Knowledge: Familiarity with SQL and NoSQL databases like MySQL, PostgreSQL, MongoDB, or DynamoDB.
Networking & Security: Understanding of networking concepts, load balancing, firewalls, and security best practices.
Scripting & Automation: Strong scripting skills in Bash, Python, or Go to automate infrastructure tasks.
Incident Response: Ability to manage on-call rotations and incident resolution under pressure
Any Gradute