If you're evaluating which accelerator to choose for your future workloads beware that AMD's peer-to-peer bandwidth is 7x slower than all-to-all bandwidth. As long as you use full nodes or single gpus, you have nothing to worry about. But if you have to deploy TP=2, TP=4, or
AMD Peer-to-Peer Bandwidth Performance Impact on Distributed Inference
By
–
Leave a Reply