Description

Responsibilities -

· Designing, implementing, and maintaining distributed systems to build world-class ML platforms/products at scale

· Diagnose, fix, improve, and automate complex issues across the entire stack to ensure maximum uptime and performance

· Design and extend services to improve functionality and reliability of the platform

· Monitor system performance, optimize for cost and efficiency, and resolve any issues that arise

· Build relationships with stakeholders across the organization to better understand internal customer needs and enhance our product better for end users

 

Required Skills -

· 3+ years of experience in distributed systems with deep knowledge in computer science fundamentals

· Experience with containerization and orchestration technologies, such as Docker and Kubernetes.

· Experience in delivering data and machine learning infrastructure in production environments

· Experience configuring, deploying and troubleshooting large scale production environments

· Experience in designing, building, and maintaining scalable, highly available systems that prioritize ease of use

· Experience with alerting, monitoring and remediation automation in a large scale distributed environment

· Extensive programming experience in Java, Python or Go

· Strong collaboration and communication (verbal and written) skills

· B.S., M.S., or Ph.D. in Computer Science, Computer Engineering, or equivalent practical experience

 

Preferred Skills -

· Understanding of the ML lifecycle and state of the art ML Infrastructure technologies

· Experience with GPU and other type of HPC infrastructure

· Experience with training framework like PyTorch, Tensorflow, JAX

· Deep understanding of Ray and KubeRay

· Experience with ML Training/Inference profiling and optimization

Education

Any Gradute