AI Dynamics

Global AI News Aggregator

About

Multi-Head Attention: How Output Projection Enables Information Sharing

I haven’t gone back to the paper, but looking at this equation alone: When the heads are independent, W_o introduces correlations among them, allowing for information sharing and a denser representation.

→ View original post on X — @nandodf