yeah this is very related idea — dense vectors convey more information than discrete tokens, in less space! but it's not the same thing, i'm describing how to train a model that generates gist tokens on the fly for its intermediate reasoning steps
Training Models to Generate Gist Tokens for Intermediate Reasoning
By
–