After the single box ran, we scaled it to 2 and 16 systems with linear scaling. There were no model & code changes required.
Megatron
DeepSpeed
Sharding The model lives on a single block of memory. No need to break it up. The whole thing trains like one giant GPU.
Linear Scaling AI Model Training Across Multiple Systems
By
–
