“The Recurrent Transformer: Greater Effective Depth and Efficient Decoding” Transformers are great at parallel processing, but they’re shallow through time, as each layer only lets tokens interact once. This paper changes that by storing keys/values from each layer’s output,
Recurrent Transformer: Greater Effective Depth and Efficient Decoding
By
–
