Description

Needs:
Openshift
Kubernetes
Development Experience(Java, Python, Golang)
SRE Skills

Nice to Haves:
Baremetal
Cloud

Job Description:
We are looking for a highly skilled Site Reliability and operations Engineer (SRE) with extensive experience in Kubernetes-based distributed caching and compute grid solutions. This role requires a strong foundation in software development, infrastructure automation, and reliability engineering. You will be responsible for designing, implementing, and maintaining high-performance distributed systems, ensuring reliability, scalability, and efficiency.

Development & Implementation:
• Design, develop, and optimize distributed caching and compute grid solutions on Kubernetes/OpenShift
• Understanding of microservices and containerized workloads using Kubernetes, Docker, and Helm.
• Implement high-throughput compute grid solutions using IBM Spectrum Symphony, Tibco Grid Server or similar technologies.
• Optimize application performance by leveraging parallel compute strategies, load balancing, and efficient data distribution.

Site Reliability Engineering (SRE):
• Ensure high availability, scalability, and reliability of distributed systems.
• Implement observability, logging, and monitoring using tools like Prometheus, Grafana, ELK, or OpenTelemetry.
• Automate infrastructure provisioning and deployments using Ansible, and Helm Charts.
• Understanding of CI/CD pipelines for seamless software deployment.
• Troubleshoot and resolve incidents related to platform, infrastructure and distributed compute platforms, ensuring minimal downtime.

Required Skills & Qualifications:
• Strong experience in Kubernetes (OpenShift and on-prem/cloud clusters).
• Understanding of programming languages like Java, Go, or Python. – this will be the difference maker of the L4 vs L5
• Experience with containerization technologies (Docker, Helm, etc.).
• Strong knowledge of CI/CD pipelines (Jenkins, ArgoCD, GitHub Actions).
• Hands-on experience with observability tools (Prometheus, Grafana, Loki, Jaeger).
• Understanding of networking, service meshes (Istio/Linkerd), and security best practices in Kubernetes.
• Experience with multi-cluster and hybrid cloud Kubernetes deployments.
 

Education

Any Graduate