Not sure whether there is any theory behind it vs just empirical observation that it works well. Maybe an intuition is that the Q, K, V matrices work well for language in general, and you don't want to screw them up. Whereas the other ones are more like the extraction params.
Theory Behind Q, K, V Matrices in Language Models
By
–
Leave a Reply