And our answers are out! Running on 1B tokens from the web (filtered and mostly in English as details in https://
huggingface.co/papers/2306.01
116
…) we got – GPT4 tokenizer (100k vocab) gives you 0.997B tokens – Falcon tokenizer (64k vocab) gives you ~5% more tokens (1.04B)
– Llama2 tokenizer
GPT-4 Falcon Llama2 Tokenizer Comparison Study Results
By
–