AI Dynamics

Global AI News Aggregator

About

Tokenizer Training Data: Understanding Token Presence Significance

If uberinternal WAS a token that tells you that it was one of the top ~30,000 character sequences present in the text they used to build the tokenizer – which is a different corpus from the training set used to train the model uberinternal not being a token doesn't tell you much

→ View original post on X — @simonw,