no it’s very small since we’re looping over each weight matrix individually the only way memory usage would be non-negligible would be if u used a large number of devices — say 256 GPUs in that case don’t use my code 🙂
Memory Optimization for Multi-GPU Distributed AI Training
By
–