The new RedPajama paper is already a classic if you're into pre-training data It implements 46 quality filters with document-level and line-level granularity: • Natural language: fraction of all-caps words, terminal punctuation, etc.
• Repetitiveness: n-gram statistics
•
RedPajama Paper: 46 Quality Filters for Pre-training Data
By
–
Leave a Reply