Do you remember if they used it in a 3:1 ratio (like Olmo 3) or 5:1 ratio (like Gemma 3) or just pure sliding window attention?
Attention Mechanism Ratios in Language Models: Olmo vs Gemma
By
–
By
–
Do you remember if they used it in a 3:1 ratio (like Olmo 3) or 5:1 ratio (like Gemma 3) or just pure sliding window attention?