Training loss is evaluated over the batch, i.e. 0.5M tokens. It's noisy but this is expected, you could be iterating through easy or hard documents in the training data. The validation loss is averaged over 20 batches of 0.5M tokens (this is a hyperparameter), so it is smoother.
Training Loss Noise and Validation Loss Smoothing Explained
By
–
Leave a Reply