Yep, the use of regex is both a huge dependency and huge bottleneck in the tokenizer. I think it's a beautiful project to try to do this correctly, but I'd need someone who is really familiar with regex to pitch in and also a large test suite to make sure. I'd love to merge such