I am not sure this will be true, i.e., lower memory demands.
There are some many bottlenecks and opportunities for improvement. If we have better quantization for reducing KV cache sizes via TurboQuant, that just means we will use the memory capacity elsewhere:
– bigger
Memory optimization bottlenecks and quantization trade-offs in LLMs
By
–