yeah once this paper was released i was like “ohhh so fineweb isnt simply commoncrawl plus plus” and it all clicked into place. @eugeneyan pointed me to this apple paper we’ve talked about on the pod
Understanding FineWeb: Beyond CommonCrawl Dataset Architecture
By
–
Leave a Reply