Well, the tokenizer I used was trained on large quantities of data — I filtered the tokens based on yet more data from FineWeb. Question is if that's acceptable according to your rules…
Tokenizer Training and Data Filtering Compliance Standards
By
–
By
–
Well, the tokenizer I used was trained on large quantities of data — I filtered the tokens based on yet more data from FineWeb. Question is if that's acceptable according to your rules…