Description

Job Description:

  • Architect and implement distributed training strategies utilizing frameworks like Horovod or DeepSpeed.
  • Deploy and manage ML models using containerization (Docker, Kubernetes) and serving frameworks (TensorFlow Serving, TorchServe, Seldon Core).
  • Implement robust model monitoring and drift detection systems.
  • Leverage MLOps best practices for CI/CD of ML pipelines and models.
  • Profile and optimize model performance for low-latency inference.
  • Integrate with various data storage solutions (e.g., distributed file systems, vector databases).
  • Contribute to the development of internal AI/ML infrastructure and tooling.
  • Troubleshoot and debug complex distributed AI/ML systems.

 

Key Skills:

  • Deep understanding of Machine Learning paradigms (Supervised, Unsupervised, Deep Learning, Reinforcement Learning).
  • Expertise in Python and relevant scientific computing libraries (NumPy, SciPy).
  • Proficient in deep learning frameworks (TensorFlow, PyTorch) and their ecosystems.
  • Strong experience with data pipeline orchestration tools (Airflow, Kubeflow).
  • Expertise in feature engineering platforms (Feast, Tecton).
  • Solid understanding of distributed computing concepts and frameworks (Spark, Dask).
  • Experience with containerization and orchestration (Docker, Kubernetes).
  • Knowledge of ML model serving frameworks (TF Serving, TorchServe, Seldon Core).
  • Familiarity with model monitoring and drift detection techniques.
  • Strong understanding of data serialization and storage formats (e.g., Parquet, Avro, Protocol Buffers).


 

Education

Any Graduate