I saw some paper mentioned on twitter last week about the different training dynamics between multiple linears layers vs single layers — even although they are mathematically the same, the training is different. I can't find it now though 🙁
Training Dynamics: Multiple vs Single Linear Layers
By
–