We want to extend up to 1000+ languages the data-driven filtering approach we used to create the *Fineweb* and *Fineweb-edu* large scale pretraining datasets The first step –which proved surprisingly difficult– was to find reliable high-early-signal evaluations in many languages
Extending Fineweb filtering to 1000+ languages for pretraining
By
–
