AI Dynamics

Global AI News Aggregator

GQA and Sliding Window Optimization for Efficient Decoding

We trained it with GQA and a sliding window of 4096 tokens, resulting in constant cache size and a linear decoding speed. Our changes to FlashAttention v2 and xFormers to support sliding window are available to the community.

→ View original post on X — @guillaumelample,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *