Instant agreement is a trained behavior, not a judgment. RLHF rewarded models for being agreeable, and this is the side effect.
By
–
Instant agreement is a trained behavior, not a judgment. RLHF rewarded models for being agreeable, and this is the side effect.