How do top reasoning models become overconfident? MIT found that RL rewards correct answers w/o considering how sure the model is. By training them to estimate their confidence about each answer, the team boosted uncertainty estimates w/o hurting accuracy:
MIT Improves Reasoning Model Confidence Calibration Through RL Training
By
–
Leave a Reply