All our models were trained on at least 1T tokens, much more than what is typically used at this scale.
Interestingly, even after 1T tokens the 7B model was still improving.
3/n
Models trained on 1T tokens with continued improvement at 7B scale
By
–
Leave a Reply