Key Skills: Machine Learning, ML, AI Artificial intelligence, Artificial Intelligence, Tensorflow, Python, Pytorch.
Roles and Responsibilities:
- Design, build, and rigorously optimize the complete stack necessary for large-scale model training, fine-tuning, and inference--including dataloading, distributed training, and model deployment--to maximize Model Flop Utilization (MFU) on compute clusters.
- Collaborate closely with research scientists to translate state-of-the-art models and algorithms into production-grade, high-performance code and scalable infrastructure.
- Implement, integrate, and test advancements from recent research publications and open-source contributions into enterprise-grade systems.
- Profile training workflows to identify and resolve bottlenecks across all layers of the training stack--from input pipelines to inference--enhancing speed and resource efficiency.
- Contribute to evaluations and selections of hardware, software, and cloud platforms defining the future of the AI infrastructure stack.
- Use MLOps tools (e.g., MLflow, Weights & Biases) to establish best practices across the entire AI model lifecycle, including development, validation, deployment, and monitoring.
- Maintain extensive documentation of infrastructure architecture, pipelines, and training processes to ensure reproducibility and smooth knowledge transfer.
- Continuously research and implement improvements in large-scale training strategies and data engineering workflows to keep the organization at the cutting edge.
- Demonstrate initiative and ownership in developing rapid prototypes and production-scale systems for AI applications in the energy sector.
Experience Requirement:
- 5-9 years of experience building and optimizing large-scale machine learning infrastructure, including distributed training and data pipelines.
- Proven hands-on expertise with deep learning frameworks such as PyTorch, JAX, or PyTorch Lightning in multi-node GPU environments.
- Experience in scaling models trained on large datasets across distributed computing systems.
- Familiarity with writing and optimizing CUDA, Triton, or CUTLASS kernels for performance enhancement is preferred.
- Hands-on experience with AI/ML lifecycle management using MLOps frameworks and performance profiling tools.
- Demonstrated collaboration with AI researchers and data scientists to integrate models into production environments.
- Track record of open-source contributions in AI infrastructure or data engineering is a significant plus.
Education: M.E., B.Tech M.Tech (Dual), BCA, B.E., B.Tech, M. Tech, MCA