Position Overview: We are seeking a highly skilled and experienced Technical Lead – AI Observability to lead the development of a cutting-edge AI Observability platform at IBM. This role is ideal for someone with deep expertise in artificial intelligence (AI), machine learning (ML), computer vision (CV), and large language models (LLM). As the Technical Lead, you will play a key role in shaping the vision, architecture, and execution of a centralized platform that will enable real-time monitoring, debugging, and optimization of AI-driven systems.
In this leadership position, you will guide a team of engineers, collaborate with cross-functional teams, and work closely with both AI and DevOps experts to ensure that the platform delivers robust observability across multiple domains, including computer vision, machine learning, and natural language processing (NLP) models. You will also be responsible for establishing best practices and technical standards for observability, monitoring, and diagnostics within the AI ecosystem.
Key Responsibilities:
Platform Development & Architecture:
- Lead the design, development, and deployment of a centralized AI Observability platform focused on end-to-end monitoring and diagnostics for AI models, specifically in the realms of computer vision (CV), machine learning (ML), and large language models (LLMs).
- Define and implement an observability architecture that allows for seamless integration across various AI systems, ensuring efficient data flow and aggregation.
- Create scalable and reliable platforms capable of managing large-scale AI/ML workloads, focusing on performance, security, and compliance.
AI & ML Model Monitoring:
- Collaborate with data scientists and AI engineers to build metrics, monitoring dashboards, and alerts that provide deep insights into the performance, health, and behavior of AI models in production.
- Design tools to diagnose issues such as model drift, bias, errors in predictions, and other anomalies to ensure optimal model performance.
- Lead the development of mechanisms to continuously evaluate and improve model accuracy, robustness, and fairness across AI, CV, and LLM systems.
Thought Leadership & Strategic Guidance:
- Provide technical thought leadership to guide and influence the strategic direction of AI observability initiatives at IBM.
- Drive the adoption of best practices for AI observability and monitoring, ensuring they align with industry standards, security protocols, and business goals.
- Stay ahead of industry trends, innovations, and challenges related to AI/ML observability, ensuring the platform remains state-of-the-art and can scale with emerging technologies.
Team Leadership & Collaboration:
- Lead a cross-functional team of engineers and AI experts in the development and maintenance of the observability platform.
- Act as a mentor and coach to junior engineers, fostering a culture of continuous learning, innovation, and excellence.
- Collaborate with product managers, AI researchers, and data scientists to understand the technical requirements and operational needs for the AI models.
- Liaise with internal and external stakeholders to ensure platform development aligns with business goals, customer needs, and operational requirements.
AI Observability Best Practices & Documentation:
- Establish observability best practices, standards, and guidelines for AI/ML model monitoring, logging, and alerting.
- Develop and maintain comprehensive documentation that provides clarity on system architecture, tools, and procedures for AI observability.
- Ensure compliance with legal, regulatory, and ethical standards, including data privacy and AI model fairness guidelines.
Performance Optimization & Troubleshooting:
- Identify performance bottlenecks and areas for optimization within the observability platform and the underlying AI models.
- Implement strategies to address scalability issues and improve the performance of observability tools as model complexity increases.
- Troubleshoot production issues, investigate root causes, and provide solutions to mitigate downtime and improve system stability.
Qualifications & Skills:
Experience:
- Minimum of 8-10 years of experience in AI/ML engineering, with a strong focus on AI observability, monitoring, and diagnostics.
- Proven track record in leading AI/ML development projects, particularly in large-scale, production-grade AI systems.
- In-depth knowledge of computer vision, machine learning, and large language models (LLMs).
- Experience with building or managing observability platforms for AI-driven systems.
- Strong background in cloud infrastructure, DevOps, and continuous integration/continuous deployment (CI/CD) pipelines.
Technical Skills:
- Expertise in AI/ML frameworks such as TensorFlow, PyTorch, or similar.
- Deep understanding of observability tools such as Prometheus, Grafana, OpenTelemetry, or similar monitoring/alerting frameworks.
- Strong coding skills in Python, Java, or similar languages commonly used in AI/ML.
- Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Familiarity with logging frameworks (e.g., ELK Stack, Splunk).
- Expertise in AI model interpretability, transparency, and ethical AI considerations.
Leadership & Soft Skills:
- Excellent problem-solving, communication, and leadership skills.
- Proven ability to manage, guide, and mentor technical teams.
- Strong interpersonal skills and the ability to work collaboratively with diverse teams across functions.
- Ability to translate complex technical concepts into business-friendly language for executives and non-technical stakeholders.
Education:
- Bachelor’s or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience).