What if we apply RL+Reasoning into PRMs? StepWiser: Stepwise Generative Judges for Wiser Reasoning STEPWISER trains a CoT-based judge that checks whether each “chunk-of-thought” helps solve the problem, producing better step labels and stronger search than discriminative PRMs
StepWiser: Stepwise Generative Judges for Wiser Reasoning
By
–
