Training loss is evaluated over the batch, i.e. 0.5M tokens. It's noisy but this is expected, you could be iterating through easy or hard documents in the training data. The validation loss is averaged over 20 batches of 0.5M tokens (this is a hyperparameter), so it is smoother.