This is quite interesting … 1) I would expect that the opposite is true for, e.g., RNN-based LLMs like RWKV (since it's processing information sequentially, it might rather forget early information) 3/5
By
–
This is quite interesting … 1) I would expect that the opposite is true for, e.g., RNN-based LLMs like RWKV (since it's processing information sequentially, it might rather forget early information) 3/5