Responsibilities
- Create and uphold efficient, scalable, and distributed training systems—including data preprocessing, training orchestration, and model assessment—for training large-scale AI models.
- Enhance the efficiency of training procedures to improve performance and use of resources, while maintaining scalability and dependability.
- Collaborate with researchers to create training and evaluation pipelines for state-of-the-art algorithms.
- Develop and design benchmarks for evaluating ML models.
- Perform training and and fine-tuning of foundation models for robotic applications .
- Monitor and analyze pipelines, identifying bottlenecks and proposing solutions to improve efficiency and performance.
Ensure the robustness and reliability of the training infrastructure, including automated testing and continuous integration