The biggest question is whether you allow re-tokenization, and whether that should be done with the same data as the training itself. Right now there is knowledge about the language in existing tokens built-in and changing that is against the rules and/or unfavorable.