Trying to think about what a Moore's Law for Language Models would look like. the problem i'm running into is that there are two separate axes to improve upon: model ability for a given model size (perplexity) and inference speed (tokens per second) Anyway, we can all agree
Moore’s Law for Language Models: Performance and Speed Trade-offs
By
–