It doesn’t see words as embedded vectors or as letters, it sees them as token sequences — chunks of about 4 letters from a fixed mapping of strings to tokens. It does better on letter-based tasks if you ask it to first rewrite words l-i-k-e t-h-i-s.
LLM Tokenization: Not Vectors or Letters, but Token Sequences
By
–