βReasoning over Mathematical Objectsβ Most reasoning benchmarks still let models answer with multiple choice or short numerics, which makes evaluation easy but also makes the task easier than real STEM reasoning. This paper shows that when you remove the options and ask for the actual object, like an equation, matrix, set, interval, or piecewise function, the performance drops sharply, even for frontier models. So this paper proposes Principia: a benchmark, training set, and verifier pipeline built specifically for mathematical-object reasoning, plus on-policy judge training to score these hard outputs reliably. What makes this interesting is that training on these harder outputs also improves standard math and science benchmarks, suggesting this is not just better formatting, but better actual reasoning.
β View original post on X β @askalphaxiv, 2026-04-04 17:55 UTC



