Must Have:
Strong AWS (ECS OR EC2 OR Lambda OR IAM OR Cloud Formation) + (EKS OR Kubernetes OR EKS clusters) + (IaC with Terraform preferred) + (Support OR Ticketing Systems OR Handled large no. of Tickets through any platform)
Key Responsibilities:
- Deliver incident management and advanced-level L1/L2 support for internal applications across public cloud platforms, with a strong emphasis on AWS.
- Serve as the initial point of contact for application developers via a ticketing system.
- Communicate effectively with users at various organizational levels.
- Implement and utilize automation to support the scalability of the environment.
- Optimize operational processes to enhance efficiency, reliability, and security.
- Train users to self-diagnose and troubleshoot issues for expedited resolution.
- Conduct thorough investigations into issues to identify root causes and document strategies to prevent recurrence.
- Provide support for public cloud environments, particularly AWS.
- Manage events and incidents efficiently.
- Develop and implement scalable automation processes to handle tasks in a large-scale environment.
- Analyze and debug incidents, following up to gather feedback and prevent future issues.
- Support different development environments, including Unix, Linux, Mainframe, and Windows.
Required Skills and Experience:
- Proficiency in SDLC with the ability to read code (Java and Python).
- Hands-on scripting experience (Unix shell, Python).
- Extensive cloud experience, particularly with AWS.
- Expertise in Kubernetes.
- Strong troubleshooting and diagnostic skills for security and access issues in a large enterprise environment.
- Database management skills (Oracle DBA, Cassandra DBA, CockroachDB) including performance tuning, connectivity, backups, indexes, and monitoring alarms.
- Middleware and messaging experience (Kafka, MQ).
- Experience with Tomcat.
- System engineering and administration skills (Unix/Linux).
- Familiarity with monitoring tools and ticketing systems.
- Commitment to automating processes for continuous improvement.
- Excellent communication skills.
- Ability to analyze details, understand incident causation, and implement preventive measures to ensure reliability and security