“Self-Distilled RLVR”
— alphaXiv (@askalphaxiv) 8 avril 2026
Most reasoning RL rewards are reliable, but too sparse.
Self-Distillation (SD) can fix that with dense token-level signals, but if the teacher sees hidden info, the model can start learning shortcuts it will never have at test time.
So this paper, RLSD,… pic.twitter.com/qHzDTF4wef
“Self-Distilled RLVR” Most reasoning RL rewards are reliable, but too sparse. Self-Distillation (SD) can fix that with dense token-level signals, but if the teacher sees hidden info, the model can start learning shortcuts it will never have at test time. So this paper, RLSD, let RL decide whether an answer was good or bad, and let self-distillation decide which tokens deserve more credit. And instead of using a teacher to tell the model what to imitate, they use it to do token-level credit assignment, which gives denser learning than vanilla RLVR, without the instability and leakage of naive self-distillation. Empirically, RLSD stays stable while on-policy SD degrades, and beats GRPO-style baselines on multimodal reasoning.