RLVR'd model would fail to answer questions that its base model previously could. "The Invisible Leash: Why RLVR May Not Escape Its Origin" This research shows that RLVR just boosts the base model’s favorite answers (raising pass@1) and would sacrifice diversity.
RLVR Models Sacrifice Diversity for Base Model Preferences
By
–
