We can optimize this by transferring all elements to one GPU, computing the final value, and sending it back to all other GPUs.
— Akshay 🚀 (@akshay_pachaar) 17 août 2025
This is a significant improvement.
But now a single GPU must receive, compute, and communicate back the gradients, so this does not scale. pic.twitter.com/xgZJ5Fdymf
We can optimize this by transferring all elements to one GPU, computing the final value, and sending it back to all other GPUs. This is a significant improvement. But now a single GPU must receive, compute, and communicate back the gradients, so this does not scale.
