Yes, and it is linear in sequence length, so the pain shows up fast. This is why GQA, sliding windows, and quantized caches exist. They all attack the same problem from different angles: keep the cache small so decode stays fast.
By
–
Yes, and it is linear in sequence length, so the pain shows up fast. This is why GQA, sliding windows, and quantized caches exist. They all attack the same problem from different angles: keep the cache small so decode stays fast.