This is a must-watch to understand how attention works! Great visualization, explaining:
– Why the K and V matrix, what do they represent?
– Why mask the lower left part of the KV product?
– Why apply -inf to the lower left part of the KV product before softmax rather than just
Understanding Attention: K and V matrices, masking, and -inf for softmax
By
–