Nice new read on tokenization!
You've heard about the SolidGoldMagikarp token, which breaks GPT-2 because it was present in the training set of the Tokenizer, but not the LLM later. This paper digs in in a lot more depth and detail, on a lot more models, discovering a less
Deep dive into tokenization vulnerabilities across multiple language models
By
–
Leave a Reply