Awesome and highly useful: FineWeb-Edu High quality LLM dataset filtering the original 15 trillion FineWeb tokens to 1.3 trillion of the highest (educational) quality, as judged by a Llama 3 70B. +A highly detailed paper. Turns out that LLMs learn a lot better and faster
FineWeb-Edu: High-Quality LLM Dataset Filtering for Better Learning
By
–
Leave a Reply