TxT360: new pre-training dataset with 15T tokens Impressive release from LLM360 with a new pre-training dataset of 15T tokens. It includes a lot of new sources compared to previous open-sourced pre-training datasets, like FreeLaw, PG-19 (books), etc. It's really interesting
LLM360 releases TxT360: 15T token pre-training dataset
By
–
