Example here is the llm.c GPT-3 (124M) training on FineWeb (figure cropped at 250B tokens), we seem to surpass GPT-3 HellaSwag (green line) at ~150B tokens, per paper expected this to be at 300B tokens. Will re-run with FineWeb-Edu. I do want to be a bit careful on conclusions
GPT-3 Training Surpasses Expected Performance on FineWeb Dataset
By
–
Leave a Reply