Very early on in the project I did a small run with/without QK norm and found that it helped. Same for the embedding weight sharing. I'll retry! I'm not tied to any details of the model and they weren't chosen any more carefully than a single run, I spent most of the time just
QK Norm and Embedding Weight Sharing in Model Optimization
By
–
Leave a Reply