Oh shit, it seems like all the HF Research team pretraining data has been accidentally leaked to the public. The web, PDFs, and synthetic datasets are expode on hf FineData org… Apparently, an intern used CC to push the data with private=False.
→ View original post on X — @thom_wolf, 2026-03-31 18:47 UTC

Leave a Reply