NEW Easily run a single training job across n nodes​
Getting multiple GPUs to coordinate gradient sharing and parameter synchronization across nodes is a hassle. Automatically handle all the Kubernetes networking, container orchestration, inter-node communication, checkpointing, and failover.