@avshalomm solved it by utilizing the fact that actually, there is no interaction between different tokens in the MoE block, so we can iterate over the long context in chunks. This was also merged and now fixed in vLLM
MoE Token Independence Enables Efficient Long Context Processing in vLLM
By
–
Leave a Reply