We are seeking a passionate and driven Site Reliability Engineer (SRE). In this pivotal role, you will be instrumental in ensuring the reliability, scalability, and performance of our critical systems and infrastructure. You will apply a deep understanding of software engineering to operational problems, focusing on automation, resilience, and efficiency in a fast-paced, cloud-native environment.
Key Responsibilities:
- Design, implement, and maintain highly available, scalable, and secure cloud-based solutions primarily within AWS.
- Develop and enhance automation tools and scripts using a common scripting language (e.g., Python, JavaScript, Go, Ruby) to streamline operational workflows, reduce toil, and improve system reliability.
- Implement and manage infrastructure as code (IaC) leveraging Terraform to consistently provision, configure, and manage our cloud infrastructure.
- Build, optimize, and maintain robust CI/CD pipelines to ensure efficient, automated, and reliable software delivery from development to production.
- Proactively monitor system performance, identify potential bottlenecks, and implement solutions to prevent outages and optimize resource utilization.
- Participate in on-call rotations to respond to critical incidents, perform root cause analysis, and implement preventative measures.
- Collaborate closely with development, product, and security teams to embed reliability principles throughout the entire software development lifecycle.
- Drive continuous improvement by identifying operational gaps and designing automated solutions.
- Troubleshoot complex production issues across distributed systems, leveraging monitoring and logging tools.
Required Qualifications:
- 5+ years of progressive experience in a Site Reliability Engineering (SRE), DevOps, Cloud Operations, or similar role, with a strong focus on system reliability and automation.
- Proven expertise in designing, deploying, and managing highly available cloud-based solutions, with extensive hands-on experience in AWS services (e.g., EC2, S3, RDS, Lambda, VPC, ECS/EKS).
- Strong proficiency in at least one common scripting language (e.g., Python, JavaScript, Go, Ruby) for automation, tooling, and system interaction.
- Demonstrable experience with Infrastructure as Code (IaC) principles and significant hands-on experience with Terraform for managing cloud resources.
- Solid experience building, maintaining, and optimizing CI/CD pipelines (e.g., Jenkins, GitLab CI, AWS CodePipeline, CircleCI) to facilitate rapid and reliable deployments.
- Experience with monitoring, logging, and alerting tools (Splunk).
- Exceptional problem-solving skills with a methodical approach to troubleshooting complex production issues.
- Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams