If it's just one epoch, why would you expect an important difference between training loss and validation loss? (Unless there's some implicit practice I haven't heard of for LLMs, to use a validation set that isn't just a random holdout from training.)
Training and Validation Loss Discrepancy in Single Epoch LLMs
By
–
Leave a Reply