“Reinforcement Learning w/Calibration Rewards” What makes top reasoning models overconfident? MIT found that in these models, RL rewards correct answers, not certainty. Training models to estimate confidence improved calibration while maintaining accuracy:
MIT Improves Reasoning Model Calibration Through Reinforcement Learning
By
–
