AI Dynamics

Global AI News Aggregator

About

Dolma: A 3 Trillion Token Open-Source Training Dataset for LLMs

A 3 Trillion (yes, trillion) token open-source training dataset for LLMs: Dolma is built from a mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. Now available on Huggingface if you have a spare 5.4 Terabytes

→ View original post on X — @aibreakfast