Manage and ensure reliability/operations of large-scale, high-performance applications in hybrid (on-prem & cloud) environments, with a minimum of 3-5 years’ experience.
Develop automation scripts and build dashboards for Application Performance Management, focusing on transaction journey tracking.
Program using languages such as Go, Python, Java, or Rust (2-4 years’ experience required).
Work with databases like Oracle, PL/SQL, SQL Server, Redis, Clickhouse, Postgres, MongoDB, or time-series databases.
Transition and manage platforms on cloud services (GCP, AWS, Rancher/Cloud Formation/Azure/OpenShift) and maintain containerized applications (GKE/RKE/AKE); at least 2+ years’ experience.
Implement and maintain cloud observability using OTEL for real-time monitoring, distributed tracing, and incident resolution.
Utilize GraphQL frameworks (Apollo, Prisma, Hasura) for application development and troubleshooting.
Troubleshoot networking issues (TCP/IP, HTTP, DNS, Load Balancing, Service Mesh) under high-pressure situations.
Ensure 24x7 application availability, develop solutions for repetitive tasks, and improve detection/gating for critical applications.
Use monitoring tools (Splunk, AppDynamics, Grafana/Prometheus, Dynatrace) to manage application health.
Participate in CI/CD processes, leveraging tools such as Rally, Confluence, and extenders.
Implement and manage in-memory caching solutions, especially Redis.
Debug across integrated technical platforms, including API gateways.
Work with cloud databases (GCS, Cloud SQL, PL/SQL, Spanner).
Monitor and troubleshoot HashiCorp Vault environments to minimize downtime and ensure rapid incident recovery.
Apply working knowledge of Vertex AI, Gen AI, and BigQuery.
Communicate clearly and effectively with technical and non-technical stakeholders.