Ironically, Transformers are even worse in that regard — mostly due to their strongly interpolative architecture prior. Multi-head-attention literally hardcodes sample interpolation in latent space. Also, the fact that recurrence is a really helpful prior for symbolic programs.
Transformers’ Interpolative Architecture and Limitations for Symbolic Tasks
By
–
Leave a Reply