Most scaling laws assume you train once and answer once. But this paper says that if you already know you'll spend extra compute at test time by sampling many answers, then you should train a different model. So instead of a bigger model trained the usual way, it can be better to train a smaller model for much longer. As smaller models are cheaper to sample many times, those extra tries can beat one expensive shot from a larger model. So the real thing to optimize is not just training compute, but training + inference together. And this paper shows overtraining can actually become the compute-optimal choice. [Translated from EN to English]
→ View original post on X — @askalphaxiv, 2026-04-06 18:08 UTC

Leave a Reply