Sunday small guessing puzzle Let's say I have 3 tokenizers:
– llama2: 32k vocab
– falcon: 65k vocab
– GPT4: 100k vocab I take ~2M random documents from the web (let’s say 10 random parquet files from RefinedWeb from https://
huggingface.co/datasets/tiiua
e/falcon-refinedweb
… roughly 1B tokens). I tokenize them
Tokenizer vocabulary size comparison across language models
By
–