AI Dynamics

Global AI News Aggregator

About

Interleaved Head Attention Improves Transformer Architecture

“Interleaved Head Attention” A core limitation of transformers is that standard attention gives you H isolated heads, which means you only get H independent attention patterns. So this paper lets heads mix before attention by creating pseudo-heads from learned combinations of

→ View original post on X — @askalphaxiv,