AI Dynamics

Global AI News Aggregator

About

Mixture-of-Depths Attention: Transformer Architecture Innovation

"Mixture-of-Depths Attention" This paper teaches a Transformer to attend not just across tokens, but also to depth KV from its earlier layers. That helps recover shallow-layer signals that standard residual stacking tends to dilute, improving performance with only a small extra

→ View original post on X — @askalphaxiv,