There are so many LLM training tricks but which ones should you use? We quantify the effects of techniques like SwiGLU, ALiBi, etc and show how using them in conjunction can achieve the same loss with 2.86x less pretraining compute or 1.74x less parameters.
LLM Training Techniques: 2.86x Compute Efficiency Gains
By
–
