The tokenizer is an architectural prior disguised as preprocessing. And almost everyone has been treating it like plumbing. A new paper by Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, and Tiago Pimentel changes the frame: Tokenisation via Convex Relaxations
