I agree with the core thing I think you’re saying: there’s lots of architectures that could allow models to scale with respect to both data and computation. Transformers are one, maybe not even a good one. But LSTMs as we have then don’t scale — so we were missing an invention
Transformers and LSTMs: Scaling architectures and the need for innovation
By
–