(2/n) Increasing weight sparsity causes vanishing activation scales with both SP and μP, leading to poor training dynamics.
By
–

(2/n) Increasing weight sparsity causes vanishing activation scales with both SP and μP, leading to poor training dynamics.