Ah well it's certainly at least done way better than the others at finding something relevant! However it looks like this article might be mistaken in saying the 4 attention projections are the same. IIRC the GQA optimization only applies to K and V, which we see here:
GQA Optimization: K and V Attention Projections Explained
By
–
Leave a Reply