But those were also much much bigger runs, so it's a lot more impressive. This was on a single node so you don't need to deal with any cross-node interconnect. It starts to get a lot more fun when you have to keep track of O(10,000) GPUs all at once. For a very specific
Single Node Training vs Large-Scale Multi-GPU Distributed Runs
By
–
Leave a Reply