an unsung hero of the Transformer's success is not the attention mechanism per se, but *causal masking* it let us simplify two for loops to just one: > RNN
for t in [1..S]: for i in [1..L]: x[t], h = layers[i](x[t], h) > transformer for i in [1..L]: x = layers[i](x)
Causal Masking: The Unsung Hero Behind Transformer Success
By
–