Lead Infrastructure Engineer

VDart
Charlotte, NC, USA

Description

- Lead complex initiatives to develop infrastructure to provide solutions for business applications
- Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
- Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
- Make monitoring and alerting alerts on symptoms and not on outages
- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
- Have a desire to solve everyday challenges facing software engineers and automate their toil away
- Have an excellent ability to manage multiple tasks and expectations at once
- Participate in various projects intended to continually improve or upgrade the infrastructure
- Evaluate internal and external software solutions which could be leveraged to meet target state architecture goals
- Review and analyze high impact outages to ensure the proper processes and procedures are in place to avoid problems in the future
- Design, build, deploy and maintain infrastructure solutions through collaborative efforts with the team and third-party vendors
- Design, code, test, debug, and document programs using Agile development practices
- Make decisions in technical designs, implementation plans and identify project risks and resource requirements
- Direct the daily risk and control flow of operations, focusing on policies, procedures, and work standards to ensure success
- Recommend courses of action to maintain cost effectiveness and achieve results
- Collaborate and consult with peers, colleagues, and managers to resolve issues and achieve goals
- Interact with customer and vendor
- Lead small to medium cross-organizational transformational efforts in Platform space
- Provide expertise in Kafka brokers, zookeepers, Kafka connect, schema registry, KSQL, Rest proxy and Kafka Control center
- Use automation tools like provisioning using BladeLogic, Ansible, Chef, Jenkins and GitLab.
- Deliver results in less defined & constantly changing environments
- Communicate with broad and diverse audience, including technology and business leaders; ability to simplify complex messages for consumption
- As an application support specialist position is responsible for leading support functions and driving the execution and maturity of multiple application support services including incident triage, root cause analysis, change evaluation-execution-validation, deployment management, and risk & vulnerability management. Works closely with development and infrastructure partners like middleware, NAS, database, network, etc.
- Partner to influence and support innovation & continued drive towards automation, touch less operational sustainment as a design/architecture construct working with CIO technology partners/managers
- Operational sustainment and reduce risks in the eco-system by aggressively pursuing safety and soundness type of actions not limited to vulnerability, patching, end of life and resiliency
- Hands on engagement on all Production environment RunOps & DevOps support activities needed for the platform and applications
- Drive operational management via Incident response, communication and tracking along with root cause identification and closure.
- Manage and coordinate Production change requests and release management.
- Provides operational continuity through the development, management, measurement, analysis and reporting of key service-level metrics as required by management
- Sustained focus on driving continuous services improvements and innovation to design, implement and ensure SLAs, KPIs and OLAs for the critical business processes, applications, and partner interfaces
- Regular presentation of Production performance and incident, root cause and preventative actions, and trend analysis to technical and business Management teams.
- Maintain and update all Production related documentation (e.g., game plans, run books, procedures, processes).
- Ensure effective Production systems monitoring, alarming and notification response/maintenance.
- Provides general oversight and direction to virtual teams.

Required Qualifications, US:

- 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
- 5+ years of experience troubleshooting environments across the entire architecture (i.e., applications to infrastructure)
- 3+ years of hands-on Linux administration experience

Desired Qualifications:

- 1+ years of experience in Artificial Intelligence, Natural Language Processing, Machine Learning, Distributed Computing, Chatbot, and Virtual Assistant
- 1+ Years of experience supporting and monitoring Apache Flink solutions for real-time data processing
- 1+ Years supporting and monitoring service load balancing architectures including F5, VMware AVI
- 1+ years of experience with Big Data or Hadoop tools such as Spark, Hive, Kafka, and Map
- Cloud Architect or Engineer Certification (i.e. GCP, Azure, AWS, etc.)
- A BS/BA degree or higher in information technology
- Competent working in one or more environments highly integrated with an operating system.
- Have experience with VMWare Pivotal Cloud Foundry (PCF) and Tanzu Application Service (TAS) technologies
- Have experience with Docker, OpenShift Container Platform (OCP), Kubernetes, Terraform, or similar IaC technologies
- Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies
- Experience implementing and administering/managing technical solutions in major, large-scale system implementations.
- High critical thinking skills to evaluate alternatives and present solutions that are consistent with business objectives and strategy.
- Ability to lead projects/initiatives with high risk and complexity
- Ability to manage to production goals/SLAs/SLOs/KPIs, deadlines, and operational metrics
- Ability to manage tasks independently and take ownership of responsibilities
- Ability to learn from mistakes and apply constructive feedback to improve performance
- Ability to adapt to a rapidly changing environment.
- Proven leadership abilities including effective knowledge sharing, conflict resolution, facilitation of open discussions, fairness and displaying appropriate levels of assertiveness.
- Ability to communicate highly complex technical information clearly and articulately for all levels and audiences.
- Willingness to learn new technologies/tool and train your peers.
- Ability to identify root-cause issues, articulate improvement opportunities, and design approaches/programs/products to improve overall quality assurance
- Strong knowledge of monitoring tools & their application (Glassbox, AppDynamics, Splunk, BigPanda AIOps, etc.)
- Understanding of system performance and how load drives utilization and customer experiences.
- Experience with Business Continuity Planning and Disaster Recovery, Application Resiliency/Highly Available Architecture, Site Resiliency
- Knowledge and understanding of Conversational Artificial Intelligence, Machine Learning, Deep Learning, Linear Regression, Models

Key Skills

Gcp Azure Aws Spark Hive Kafka Html Kubernetes Terraform Mongodb

Education

Bachelor's degree

Apply Now

Back To Jobs

Posted On: 30+ Days Ago
Experience: 5+ years of experience
Openings: 1
Category: Sr. Infrastructure Engineer
Tenure: Contract - Corp-to-Corp Position