Modern transformers are well-behaved according to well-behaved scaling laws (which are a function of (num. of tokens, num. of parameters).
This allows one to find the hyperparameters at a smaller scale, and then keep scaling up parameters and data according to some power law.
Scaling Laws Enable Efficient Transformer Hyperparameter Optimization
By
–