I haven’t gone back to the paper, but looking at this equation alone: When the heads are independent, W_o introduces correlations among them, allowing for information sharing and a denser representation.
Multi-Head Attention: How Output Projection Enables Information Sharing
By
–