“Interleaved Head Attention” A core limitation of transformers is that standard attention gives you H isolated heads, which means you only get H independent attention patterns. So this paper lets heads mix before attention by creating pseudo-heads from learned combinations of
Interleaved Head Attention Improves Transformer Architecture
By
–
Leave a Reply