Like I said, I already knew about positional embeddings being trained rather than being straight from the 2018 paper and had heard of axial transformers and various other attempts to defeat quadratic attention. (I think I can predict your next gatekeep, but go ahead and do it.)
Knowledge of Positional Embeddings and Attention Mechanism Improvements
By
–
Leave a Reply