AI Dynamics

Global AI News Aggregator

About

Attention Mechanism Ratios in Language Models: Olmo vs Gemma

Do you remember if they used it in a 3:1 ratio (like Olmo 3) or 5:1 ratio (like Gemma 3) or just pure sliding window attention?

→ View original post on X — @rasbt