Our reward design combines correctness, preference, and efficiency. Preference only counts when the answer is correct. This keeps the model from optimizing for better-sounding wrong answers.
Reward Design Balances Correctness Preference Efficiency
By
–
Leave a Reply