Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding
Partner with development teams to improve services through rigorous testing and release procedures
Participate in system design consulting, platform management, and capacity planning
Create sustainable systems and services through automation and uplifts
Balance feature development speed and reliability with well-defined service-level objectives
QUALIFICATIONS
Bachelor’s degree (or equivalent) in computer science or related discipline
5+ years’ experience with JAVA, J2EE, NoSQL/SQL Datastore, Spring Boot, GCP/AWS/Azure & Docker/K8 in Maintenance and Development of multi-tier applications.
Understanding of RESTful APIs and microservices platform
4+ Years of experience with any of APM and other monitoring tools such as Dynatrace, New Relic, ELK, Splunk, Prometheus, Sensu, Nagios, Kafka, DataDog, PagerDuty.
Strong experience with product & development teams to establish error budgets by identifying the right SLOs (Service level objective), SLIs (Service level indicators), KPIs (Key performance indicators) and effectively drive the use of the budget to ensure maximum domain availability/uptime.
Experience in solving complex architecture/design & business problems, work to simplify, optimize, remove bottlenecks, etc.
Architect, design & develop automation experience to reduce toil, improve recoverability, availability, latency & scalability of supported applications with understanding of MTTD (Mean Time to Detection) & MTTR (Mean Time to Resolution)
Ability to quickly diagnose and resolve issues in high-pressure situations.
Strong verbal and written communication skills to effectively collaborate with cross-functional teams and articulate technical concepts to non-technical stakeholders.
Experience in leading teams, mentoring junior staff, and promoting a culture of continuous improvement and learning.
Ability to analyze complex data to improve system performance and predict future challenges.
Experience in handling outages and the ability to lead incident response efforts, minimizing impact on services.
Understanding of network architecture, protocols, and security practices to ensure robust and secure systems.
Skills/understanding of performance tuning and optimization of systems and applications.
Knowledge of database administration and management, particularly in configuring, managing, and scaling databases.
Experience in planning and executing disaster recovery strategies to ensure data integrity and availability