That’s right — the token vocabulary is determined using BPE so long strings only get their own tokens when they occur often in the pre-train corpus.
By
–
That’s right — the token vocabulary is determined using BPE so long strings only get their own tokens when they occur often in the pre-train corpus.