Also, DSA is basically a smarter version of sliding window attention where you "learn" which past tokens to select versus forcing it to be in a specific window
DSA: Learning-Based Token Selection Beyond Fixed Window Attention
By
–
By
–
Also, DSA is basically a smarter version of sliding window attention where you "learn" which past tokens to select versus forcing it to be in a specific window