AI Dynamics

Global AI News Aggregator

About

Cerebras Partner Launches TxT360 Trillion Token Dataset

Cerebras’ partner @mbzuai has announced TxT360 (Trillion eXtracted Text) — the first globally deduplicated dataset across most used data sources for LLM pretraining, and an optimized upsampling recipe to expand to 15T+ tokens of high-quality open-source data for pretraining LLMs.

→ View original post on X — @cerebras