The last few iters you may be seeing early signs of instability. I saw the same at around 250B tokens, it slowly gets worse and worse and then loss spikes. I haven’t stabilized it yet, right now seeing how easy it goes away with simple solutions, resetting the data loader etc
Training instability detected around 250B tokens during model training
By
–