You mean alternatives to self-attention specifically (versus parameter efficient finetuning etc.)? I think none of them stood the test of time. I think the problem is that they are all approximations. The relatively recent FlashAttention mechanism is super popular though.
Self-Attention Alternatives Failed; FlashAttention Gains Popularity
By
–
Leave a Reply