We trained it with GQA and a sliding window of 4096 tokens, resulting in constant cache size and a linear decoding speed. Our changes to FlashAttention v2 and xFormers to support sliding window are available to the community.
GQA and Sliding Window Optimization for Efficient Decoding
By
–
Leave a Reply