-
- Lead complex initiatives to develop infrastructure to provide solutions for business applications
- Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
- Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
- Make monitoring and alerting alerts on symptoms and not on outages
- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
- Have a desire to solve everyday challenges facing software engineers and automate their toil away
- Have an excellent ability to manage multiple tasks and expectations at once
- Participate in various projects intended to continually improve or upgrade the infrastructure
- Evaluate internal and external software solutions which could be leveraged to meet target state architecture goals
- Review and analyze high impact outages to ensure the proper processes and procedures are in place to avoid problems in the future
- Design, build, deploy and maintain infrastructure solutions through collaborative efforts with the team and third-party vendors
- Design, code, test, debug, and document programs using Agile development practices
- Make decisions in technical designs, implementation plans and identify project risks and resource requirements
- Direct the daily risk and control flow of operations, focusing on policies, procedures, and work standards to ensure success
- Recommend courses of action to maintain cost effectiveness and achieve results
- Collaborate and consult with peers, colleagues, and managers to resolve issues and achieve goals
- Interact with customer and vendor
- Lead small to medium cross-organizational transformational efforts in Platform space
- Provide expertise in Kafka brokers, zookeepers, Kafka connect, schema registry, KSQL, Rest proxy and Kafka Control center
- Use automation tools like provisioning using BladeLogic, Ansible, Chef, Jenkins and GitLab.
- Deliver results in less defined & constantly changing environments
- Communicate with broad and diverse audience, including technology and business leaders; ability to simplify complex messages for consumption
- As an application support specialist position is responsible for leading support functions and driving the execution and maturity of multiple application support services including incident triage, root cause analysis, change evaluation-execution-validation, deployment management, and risk & vulnerability management. Works closely with development and infrastructure partners like middleware, NAS, database, network, etc.
- Partner to influence and support innovation & continued drive towards automation, touch less operational sustainment as a design/architecture construct working with CIO technology partners/managers
- Operational sustainment and reduce risks in the eco-system by aggressively pursuing safety and soundness type of actions not limited to vulnerability, patching, end of life and resiliency
- Hands on engagement on all Production environment RunOps & DevOps support activities needed for the platform and applications
- Drive operational management via Incident response, communication and tracking along with root cause identification and closure.
- Manage and coordinate Production change requests and release management.
- Provides operational continuity through the development, management, measurement, analysis and reporting of key service-level metrics as required by management
- Sustained focus on driving continuous services improvements and innovation to design, implement and ensure SLAs, KPIs and OLAs for the critical business processes, applications, and partner interfaces
- Regular presentation of Production performance and incident, root cause and preventative actions, and trend analysis to technical and business Management teams.
- Maintain and update all Production related documentation (e.g., game plans, run books, procedures, processes).
- Ensure effective Production systems monitoring, alarming and notification response/maintenance.
- Provides general oversight and direction to virtual teams.
Required Qualifications, US:
-
- 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
- 5+ years of experience troubleshooting environments across the entire architecture (i.e., applications to infrastructure)
- 3+ years of hands-on Linux administration experience
Desired Qualifications:
-
- 1+ years of experience in Artificial Intelligence, Natural Language Processing, Machine Learning, Distributed Computing, Chatbot, and Virtual Assistant
- 1+ Years of experience supporting and monitoring Apache Flink solutions for real-time data processing
- 1+ Years supporting and monitoring service load balancing architectures including F5, VMware AVI
- 1+ years of experience with Big Data or Hadoop tools such as Spark, Hive, Kafka, and Map
-
- Cloud Architect or Engineer Certification (i.e. GCP, Azure, AWS, etc.)
- A BS/BA degree or higher in information technology
- Competent working in one or more environments highly integrated with an operating system.
- Have experience with VMWare Pivotal Cloud Foundry (PCF) and Tanzu Application Service (TAS) technologies
- Have experience with Docker, OpenShift Container Platform (OCP), Kubernetes, Terraform, or similar IaC technologies
- Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies
- Experience implementing and administering/managing technical solutions in major, large-scale system implementations.
- High critical thinking skills to evaluate alternatives and present solutions that are consistent with business objectives and strategy.
- Ability to lead projects/initiatives with high risk and complexity
- Ability to manage to production goals/SLAs/SLOs/KPIs, deadlines, and operational metrics
- Ability to manage tasks independently and take ownership of responsibilities
- Ability to learn from mistakes and apply constructive feedback to improve performance
- Ability to adapt to a rapidly changing environment.
- Proven leadership abilities including effective knowledge sharing, conflict resolution, facilitation of open discussions, fairness and displaying appropriate levels of assertiveness.
- Ability to communicate highly complex technical information clearly and articulately for all levels and audiences.
- Willingness to learn new technologies/tool and train your peers.
- Ability to identify root-cause issues, articulate improvement opportunities, and design approaches/programs/products to improve overall quality assurance
- Strong knowledge of monitoring tools & their application (Glassbox, AppDynamics, Splunk, BigPanda AIOps, etc.)
- Understanding of system performance and how load drives utilization and customer experiences.
- Experience with Business Continuity Planning and Disaster Recovery, Application Resiliency/Highly Available Architecture, Site Resiliency
- Knowledge and understanding of Conversational Artificial Intelligence, Machine Learning, Deep Learning, Linear Regression, Models