Description

  •  
    • Lead complex initiatives to develop infrastructure to provide solutions for business applications
    • Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
    • Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
    • Make monitoring and alerting alerts on symptoms and not on outages
    • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
    • Have a desire to solve everyday challenges facing software engineers and automate their toil away
    • Have an excellent ability to manage multiple tasks and expectations at once
    • Participate in various projects intended to continually improve or upgrade the infrastructure
    • Evaluate internal and external software solutions which could be leveraged to meet target state architecture goals
    • Review and analyze high impact outages to ensure the proper processes and procedures are in place to avoid problems in the future
    • Design, build, deploy and maintain infrastructure solutions through collaborative efforts with the team and third-party vendors
    • Design, code, test, debug, and document programs using Agile development practices
    • Make decisions in technical designs, implementation plans and identify project risks and resource requirements
    • Direct the daily risk and control flow of operations, focusing on policies, procedures, and work standards to ensure success
    • Recommend courses of action to maintain cost effectiveness and achieve results
    • Collaborate and consult with peers, colleagues, and managers to resolve issues and achieve goals
    • Interact with customer and vendor
    • Lead small to medium cross-organizational transformational efforts in Platform space
    • Provide expertise in Kafka brokers, zookeepers, Kafka connect, schema registry, KSQL, Rest proxy and Kafka Control center
    • Use automation tools like provisioning using BladeLogic, Ansible, Chef, Jenkins and GitLab.
    • Deliver results in less defined & constantly changing environments
    • Communicate with broad and diverse audience, including technology and business leaders; ability to simplify complex messages for consumption
    • As an application support specialist position is responsible for leading support functions and driving the execution and maturity of multiple application support services including incident triage, root cause analysis, change evaluation-execution-validation, deployment management, and risk & vulnerability management. Works closely with development and infrastructure partners like middleware, NAS, database, network, etc.  
    • Partner to influence and support innovation & continued drive towards automation, touch less operational sustainment as a design/architecture construct working with CIO technology partners/managers 
    • Operational sustainment and reduce risks in the eco-system by aggressively pursuing safety and soundness type of actions not limited to vulnerability, patching, end of life and resiliency
    • Hands on engagement on all Production environment RunOps & DevOps support activities needed for the platform and applications 
    • Drive operational management via Incident response, communication and tracking along with root cause identification and closure.
    • Manage and coordinate Production change requests and release management.
    • Provides operational continuity through the development, management, measurement, analysis and reporting of key service-level metrics as required by management
    • Sustained focus on driving continuous services improvements and innovation to design, implement and ensure SLAs, KPIs and OLAs for the critical business processes, applications, and partner interfaces
    • Regular presentation of Production performance and incident, root cause and preventative actions, and trend analysis to technical and business Management teams.
    • Maintain and update all Production related documentation (e.g., game plans, run books, procedures, processes).
    • Ensure effective Production systems monitoring, alarming and notification response/maintenance.
    • Provides general oversight and direction to virtual teams.

 

Required Qualifications, US:

  •  
    • 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
    • 5+ years of experience troubleshooting environments across the entire architecture (i.e., applications to infrastructure)
    • 3+ years of hands-on Linux administration experience


Desired Qualifications:

  •  
    • 1+ years of experience in Artificial Intelligence, Natural Language Processing, Machine Learning, Distributed Computing, Chatbot, and Virtual Assistant
    • 1+ Years of experience supporting and monitoring Apache Flink solutions for real-time data processing
    • 1+ Years supporting and monitoring service load balancing architectures including F5, VMware AVI
    • 1+ years of experience with Big Data or Hadoop tools such as Spark, Hive, Kafka, and Map
  •  
    • Cloud Architect or Engineer Certification (i.e. GCP, Azure, AWS, etc.)
    • A BS/BA degree or higher in information technology
    • Competent working in one or more environments highly integrated with an operating system.
    • Have experience with VMWare Pivotal Cloud Foundry (PCF) and Tanzu Application Service (TAS) technologies
    • Have experience with Docker, OpenShift Container Platform (OCP), Kubernetes, Terraform, or similar IaC technologies
    • Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies
    • Experience implementing and administering/managing technical solutions in major, large-scale system implementations.
    • High critical thinking skills to evaluate alternatives and present solutions that are consistent with business objectives and strategy.
    • Ability to lead projects/initiatives with high risk and complexity
    • Ability to manage to production goals/SLAs/SLOs/KPIs, deadlines, and operational metrics
    • Ability to manage tasks independently and take ownership of responsibilities
    • Ability to learn from mistakes and apply constructive feedback to improve performance
    • Ability to adapt to a rapidly changing environment.
    • Proven leadership abilities including effective knowledge sharing, conflict resolution, facilitation of open discussions, fairness and displaying appropriate levels of assertiveness.
    • Ability to communicate highly complex technical information clearly and articulately for all levels and audiences.
    • Willingness to learn new technologies/tool and train your peers.
    • Ability to identify root-cause issues, articulate improvement opportunities, and design approaches/programs/products to improve overall quality assurance
    • Strong knowledge of monitoring tools & their application (Glassbox, AppDynamics, Splunk, BigPanda AIOps, etc.)
    • Understanding of system performance and how load drives utilization and customer experiences.
    • Experience with Business Continuity Planning and Disaster Recovery, Application Resiliency/Highly Available Architecture, Site Resiliency
    • Knowledge and understanding of Conversational Artificial Intelligence, Machine Learning, Deep Learning, Linear Regression, Models

Education

Bachelor's degree