decoding (tokens -> string) is just lookup table and string concat. encoding (string -> tokens) is a pain. For sentencepiece I *think* llama2.c has a simple implementation that probably works but I'm not 100% sure: https://
github.com/karpathy/llama
2.c/blob/master/run.c#L452
… For tiktoken-style, the problem is the
Token Encoding and Decoding: Asymmetric Complexity in LLMs
By
–
Leave a Reply