Great question yes I was surprised that 10B seemed enough. I believe GPT-2 was trained on somewhere ~100B tokens. The reason we reach this performance in 10B tokens I think may be the following: 1. FineWeb could just be higher quality than WebText, on a per-token basis. This was
Why 10B Tokens Suffice for GPT Training Performance
By
–
Leave a Reply