The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only paper page: https://
huggingface.co/papers/2306.01
116
… Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media
RefinedWeb Dataset: High-Quality Web Data for Falcon LLM Training
By
–
