Some really interesting comments below, but the question is still open and requires investigation. I hope a few students pick it up. I liked the discussions on context length generalisation, the fact that we typically train these models as bandits (even when we do RL, which is
Context Length Generalisation and Bandit Training in Language Models
By
–
Leave a Reply