6/ Attention as an RNN – presents a new attention mechanism that can be trained in parallel (like Transformers) and be updated efficiently with new tokens requiring constant memory usage for inferences (like RNNs).
Attention as RNN: Parallel Training with Constant Memory Inference
By
–
