I suspect it is all because of the training data and how humans write: the most important information is usually in the beginning or the end (think paper Abstracts and Conclusion sections), and it's then how LLMs parameterize the attention weights during training. 5/5
LLM Attention Weights: Training Data Structure and Human Writing Patterns
By
–
Leave a Reply