The FineWeb team is happy to finally release "FineWeb2" FineWeb 2 extends the data driven approach to pre-training dataset design introduced in FineWeb 1 to now covers 1893 languages In our experiments, it tops all other publicly available multilingual pretraining datasets
FineWeb2 Releases Multilingual Dataset for AI Pretraining
By
–
Leave a Reply