Description

Minimum Qualifications

MLOps Experience: Demonstrated experience in operationalizing and maintaining machine learning models in production environments, including deployment, monitoring, and lifecycle management.

Data Pipeline Operations: Extensive experience maintaining and troubleshooting data pipelines built with tools like Apache Airflow, Prefect, cloud data services (AWS, Azure, GCP), and data processing frameworks (Spark, Kafka), ensuring reliable data flow for AI systems.

System Monitoring: Proficiency in monitoring AI system and data pipeline performance, detecting anomalies, and implementing proactive measures to ensure system reliability and availability.

Incident Management: Strong experience in troubleshooting, diagnosing, and resolving AI system and data infrastructure issues, with the ability to prioritize incidents based on business impact.

Performance Optimization: Knowledge of techniques to optimize AI system and data pipeline performance, including resource allocation, scaling strategies, and performance tuning.

Change Management: Experience implementing changes to production AI systems and data pipelines with minimal disruption, including testing, validation, and rollback procedures.

Data Quality Management: Understanding of data quality principles and their impact on AI system performance, with the ability to identify and address data-related issues in processing pipelines.

Documentation and Knowledge Management: Excellence in creating and maintaining operational documentation, runbooks, and knowledge articles for AI systems and data pipelines.

Automation Skills: Ability to create and implement automation scripts and workflows to streamline routine operational tasks for both AI systems and data flows, enhancing overall system reliability.

DevOps Practices: Familiarity with DevOps and CI/CD principles as applied to AI systems and data pipelines, including containerization, orchestration, and infrastructure as code.

Security Awareness: Understanding of security best practices for AI operations and data handling, including access control, data protection, and vulnerability management.

Collaboration Skills: Strong ability to work with cross-functional teams, communicate technical concepts clearly, and coordinate incident response activities effectively.

Problem-solving: Excellent analytical and problem-solving skills, with the ability to troubleshoot complex issues in AI systems and data infrastructure in a methodical and efficient manner.

Compliance Knowledge: Understanding of relevant regulations and compliance requirements affecting AI systems and data processing in higher education environments.

Communication Skills: Clear and concise communication abilities, both written and verbal, to document procedures, report incidents, and coordinate with stakeholders.

Service Management: Knowledge of IT service management principles and frameworks, with experience applying them to AI and data pipeline operations.


 

Bachelor’s degree in computer science, Information Technology, or related field; technical certifications in relevant areas (e.g., cloud platforms, MLOps, data engineering) preferred.

Minimum of 3 years of experience in IT operations, with at least 1 year focused on AI/ML systems and data pipeline support.

Experience with cloud platforms (AWS, Azure, or GCP) and their AI/ML and data engineering service offerings.

 

Key Responsibilities & Accountabilities

Identify the most important job duties (maximum of 5) using no more than 3-4 concise sentences. Indicate the typical percent of time required for each job duty; the total percent of time must equal 100%. Begin with the most important duty.


 

Percent of Time


 

System Monitoring and Incident Management

Monitor AI system and data pipeline health, performance, and availability using established monitoring tools and dashboards. Detect, triage, and resolve incidents affecting AI systems and their data infrastructure, coordinating with technical teams as needed. Implement proactive measures to prevent recurring issues and minimize service disruptions.


 

Operational Support and Maintenance

Perform routine operational tasks to maintain AI systems and data pipelines, including model updates, data refreshes, pipeline maintenance, and system patches. Implement scheduled maintenance activities with minimal service disruption. Manage user access and permissions for AI platforms according to security policies.


 

Performance Analysis and Optimization

Analyze AI system and data pipeline performance metrics, identify bottlenecks and inefficiencies, and implement optimizations to improve response times, data flow, accuracy, and resource utilization. Monitor for model drift and data quality issues, coordinating retraining or pipeline adjustments when necessary.


 

Documentation and Knowledge Management

Create and maintain comprehensive operational documentation, including runbooks, standard operating procedures, and knowledge base articles. Document system configurations, data pipeline dependencies, and recovery procedures to ensure operational continuity.


 

Continuous Improvement and Automation

Identify opportunities for process improvement and automation in AI operations. Develop and implement scripts and workflows to automate routine tasks, reducing manual effort and minimizing human error. Contribute to the evolution of MLOps practices based on operational experience and emerging best practices.

Education

Bachelor’s degree in computer science, Information Technology