Great piece on RL! One thing I have noticed with RULER is that you don't need o3 or any big model as the judge for every run. Qwen3 32B works well for several tasks and costs a fraction. One can always start cheap validate the score separation looks right, then scale up the
Cost-Effective RL Evaluation: Qwen3 32B Alternative to o3
By
–