We are seeking a highly skilled Site Reliability and Operations Engineer (SRE) with a robust background in Kubernetes-based distributed caching and compute grid systems. The ideal candidate will possess a solid blend of infrastructure engineering and software development skills. This role will focus on the design, implementation, and maintenance of high-performance distributed platforms to ensure high availability, scalability, and system observability.
Job Responsibilities:
Development & Implementation:
Design, build, and enhance distributed caching and compute grid solutions on Kubernetes/OpenShift platforms.
Leverage technologies such as IBM Spectrum Symphony, Tibco Grid Server, or similar for high-throughput compute grids.
Utilize containerization tools (Docker, Helm) to orchestrate microservices and container workloads.
Apply parallel compute strategies and optimize load balancing for application performance.
Site Reliability Engineering (SRE):
Ensure platform reliability, scalability, and minimal downtime by maintaining robust distributed systems.
Implement and maintain observability and monitoring using Prometheus, Grafana, ELK, or OpenTelemetry.
Automate infrastructure provisioning and deployments using Ansible, Helm Charts, and similar tools.
Troubleshoot complex system and infrastructure issues in Kubernetes environments.
Support CI/CD processes using tools like Jenkins, ArgoCD, and GitHub Actions.
Required Skills & Qualifications:
Preferred Qualifications:
Any Gradute