Cerebras’ partner @mbzuai has announced TxT360 (Trillion eXtracted Text) — the first globally deduplicated dataset across most used data sources for LLM pretraining, and an optimized upsampling recipe to expand to 15T+ tokens of high-quality open-source data for pretraining LLMs.
Cerebras Partner Launches TxT360 Trillion Token Dataset
By
–
