In their Olmo 2 report they had an ablation study showing it reduces the loss spikes during training (but they also included QK norm, so it's hard to say how much of that reduction is due to QK norm and their post norm flavor).
Maybe best of both words is to do both like Gemma
Olmo 2 Ablation Study on Loss Spikes and Normalization Techniques
By
–
Leave a Reply