As a Lead Site Reliability Engineer (SRE), you will leverage your extensive experience in SRE practices to
maintain and enhance the reliability, performance, and scalability of mission-critical systems. You will
play a crucial role in ensuring the continuous availability and optimal functioning of our services.
Key Responsibilities:
• Senior-Level SRE Expertise: Apply your deep understanding of SRE principles to lead efforts in
improving system reliability and operational efficiency.
• Incident Management: Provide expert-level support during incidents, ensuring swift resolution
with minimal service disruption. Lead post-incident reviews to drive continuous improvement.
• Monitoring & Alerting: Design, implement, and optimize monitoring, alerting, and incident
response processes. Ensure the effectiveness of these systems to proactively address potential
issues.
• Automation: Drive the automation of manual processes to enhance operational efficiency,
reduce human error, and increase overall system resilience.
• CI/CD Pipeline Management: Develop, maintain, and improve automated CI/CD pipelines using
tools such as GitLab CI/CD and Jenkins, ensuring seamless and reliable deployment processes.
• Cross-Functional Collaboration: Work closely with cross-functional teams to ensure the
reliability, performance, and scalability of our infrastructure. Foster a culture of collaboration
and knowledge sharing.
• Support Across Time Zones: Provide support across all U.S. time zones, with the flexibility to
work weekends, rotational shifts, and overtime as required to maintain service continuity.
Required Skills & Qualifications:
• Java Programming: Advanced proficiency in Java, with a deep understanding of contemporary
software development practices.
• Kubernetes & Containerization: Extensive hands-on experience with Kubernetes, including
containerization technologies like Docker and Kubernetes storage solutions such as Portworx.
• Linux/Unix Systems: Strong command of Linux/Unix operating systems and Shell Scripting
(BASH), with a focus on system reliability and automation.
• Functional Programming: Proficiency in functional programming languages such as Prolog,
Haskell, and OCaml.
• Scripting & Automation: Experience with Python or Go, particularly in the context of scripting
and automation tasks.
• Virtualization: In-depth knowledge of VMware and other virtualization platforms, with a focus
on optimizing virtual environments for reliability and performance.
• Streaming Technologies: Expertise with Kafka Stream Generator, KSQLDB, cluster federation, and
Spark Streams, including experience in managing and optimizing streaming data architectures.
• Service Mesh & Networking: Familiarity with Istio and Anthos Service Mesh, with the ability to
manage and optimize service meshes for complex environments.
• Performance Monitoring & Debugging: Proficiency in using EBPF (Extended Berkeley Packet
Filter) for performance monitoring and debugging.
• Monitoring & Logging Tools: Experience with industry-standard monitoring and logging tools
such as Splunk, Prometheus, Datadog, and Kiali.
• Load Balancing: Familiarity with Nginx Controller and Seesaw for effective load balancing and
traffic management.
• Infrastructure-as-Code (IaC): Competence in using Terraform for managing cloud infrastructure,
ensuring consistency and scalability across environments
Any Gradute