MIT HAN Lab introduce a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities.
MIT KV Cache Framework Reduces LLM Decoding Latency
By
–