Because we sum, the gradients for each token get backpropagated to all the rows that were used for it. So with enough training, the model ends up at a good compromise. Frequent words are naturally prioritised in this, because they'll simply have more gradients.
Gradient Backpropagation and Frequent Word Prioritization in Training
By
–
Leave a Reply