AI Dynamics

Global AI News Aggregator

RedPajama Paper: 46 Quality Filters for Pre-training Data

The new RedPajama paper is already a classic if you're into pre-training data It implements 46 quality filters with document-level and line-level granularity: • Natural language: fraction of all-caps words, terminal punctuation, etc.
• Repetitiveness: n-gram statistics

→ View original post on X — @maximelabonne,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *