1.Python Development for Automation & Reliability
- Design, develop, and maintain Python-based automation tools and scripts to enhance system reliability.
- Build APIs, microservices, and infrastructure automation solutions.
- Write efficient, scalable, and maintainable Python code for operations and monitoring.
2. Site Reliability Engineering (SRE) & System Performance Optimization
- Ensure the availability, scalability, and resilience of IT infrastructure and services.
- Implement self-healing mechanisms and proactive monitoring for system stability.
- Optimize application performance, database queries, and system resources.
3. Infrastructure as Code (IaC) & Cloud Automation
- Implement Infrastructure as Code (IaC) using Terraform, Ansible, or CloudFormation.
- Automate cloud infrastructure provisioning on AWS, GCP, or Azure.
- Manage and optimize containerized workloads using Docker and Kubernetes.
4. CI/CD & Deployment Management
- Develop and maintain CI/CD pipelines to automate deployment processes.
- Ensure zero-downtime deployments and improve rollback strategies.
- Collaborate with DevOps teams to enhance software release processes.
5. Observability, Monitoring & Incident Management
- Implement logging, monitoring, and alerting solutions (Prometheus, Grafana, ELK, Datadog).
- Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Participate in on-call rotations and resolve production incidents efficiently.
6. Security & Compliance
- Implement automated security scanning and compliance checks.
- Ensure secure coding practices and vulnerability assessments.
- Collaborate with security teams to improve system hardening and access controls.
Required Skills & Qualifications:
Technical Skills:
- Strong proficiency in Python for automation, scripting, and backend development.
- Experience in Linux/Unix system administration and troubleshooting.
- Hands-on experience with cloud platforms (AWS, GCP, or Azure).
- Proficiency in CI/CD tools (Jenkins, GitLab CI/CD, ArgoCD).
- Knowledge of Kubernetes and Docker for containerized deployments.
- Experience in monitoring and logging tools (Prometheus, Grafana, ELK, Datadog).
- Expertise in Infrastructure as Code (IaC) using Terraform, Ansible, or CloudFormation.
- Understanding of networking, DNS, load balancing, and security best practices.
Soft Skills:
- Strong analytical and problem-solving skills. Excellent communication and collaboration skills.
- Passion for automation, DevOps, and reliability engineering