In llm.c pretraining we were already mildly perplexed why seem to be outperforming GPT-2 & 3 (124M) training on just 10B tokens instead of something closer to 100-300B, per the original papers. I suspect a good chunk of it may be just the dataset quality, so I'm eager to retrain
llm.c Outperforming GPT-2/3 with Fewer Training Tokens
By
–
Leave a Reply