A 3 Trillion (yes, trillion) token open-source training dataset for LLMs: Dolma is built from a mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. Now available on Huggingface if you have a spare 5.4 Terabytes
Dolma: A 3 Trillion Token Open-Source Training Dataset for LLMs
By
–
