This is quite interesting … 1) I would expect that the opposite is true for, e.g., RNN-based LLMs like RWKV (since it's processing information sequentially, it might rather forget early information) 3/5
RNN-based LLMs and information retention patterns
By
–
Leave a Reply