Small correction: prefill is compute-bound, decode is the memory-bandwidth-bound phase. But the KV cache point is spot on. It grows linearly with context, and every decode step has to read the whole thing.
By
–
Small correction: prefill is compute-bound, decode is the memory-bandwidth-bound phase. But the KV cache point is spot on. It grows linearly with context, and every decode step has to read the whole thing.