- Work with a variety of tools and technologies to ensure the reliability and performance of the Managed Services organization.
- Drive initiatives to optimize and secure infrastructure operations while contributing to scalability and continuous improvement efforts.
- Maintain and ensure operational excellence of systems and applications supporting Managed Services infrastructure.
- Troubleshoot and resolve issues in a fast-paced, distributed environment, focusing on root cause analysis and post-incident reviews.
- Build and maintain observability frameworks using tools like Prometheus, OpenTelemetry, and Dynatrace to ensure reliable monitoring of systems and applications.
- Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and manage Error Budgets to balance reliability and innovation.
- Use automation tools like Ansible and scripting languages like Python to reduce operational toil and optimize processes.
- Manage and troubleshoot Kubernetes clusters and containerized environments, ensuring smooth service-to-service communication.
- Diagnose and resolve networking issues across OSI layers 1-3 on systems using including packet capture analysis.
- Oversee PKI certificate management and utilize tools like HashiCorp Vault for secrets management.
- Collaborate with cross-functional teams to identify system improvements and resolve critical issues.
- Provide mentoring and guidance to junior engineers, ensuring knowledge sharing and professional growth.
Qualifications:
Core Technical Skills:
- Proficiency in Linux administration (system tuning, SSH, log analysis with tools like grep and regex).
- Strong understanding of networking protocols and troubleshooting (Layer 1-3).
- Hands-on experience with Kubernetes and container orchestration.
- Automation experience with Ansible and scripting proficiency in Python.
- Knowledge of PKI certificate management and HashiCorp Vault or similar tools for secrets management.
- Expertise in monitoring and observability tools like Prometheus, Grafana, or Dynatrace.
Reliability Engineering Skills:
- Experience defining and managing SLIs, SLOs, and Error Budgets.
- Proven ability to instrument and analyze system performance metrics.
- Deep familiarity with troubleshooting distributed systems and microservices architecture.
Soft Skills:
- Strong initiative and curiosity for deep-diving into complex issues.
- Excellent communication skills to convey solutions and collaborate across teams.
- Ability to prioritize and resolve competing priorities in high-pressure situations.
Preferred:
- Experience with observability frameworks like OpenTelemetry.
- Familiarity with ITIL frameworks and best practices in incident and problem management.
- Background in enterprise environments, navigating corporate processes and bureaucracy.
- Security experience in areas like RBAC, least privilege principles, and secure software development lifecycle processes.
- Certifications in Kubernetes or relevant technologies