If you measure downstream performance on HellaSwag rather than speedrun-equivalent loss, then different tokenizer approaches come out on top… The first run I did was much better on common-sense downstream, trained in equivalent time or better.
Tokenizer Approaches Impact LLM Performance on HellaSwag Benchmarks
By
–