AI Dynamics

Global AI News Aggregator

RLHF Training Shows Inverse Scaling Issues in Model Behavior

We also find some of the first instances of inverse scaling for RL from Human Feedback (RLHF), where more RLHF training makes behavior worse. RLHF makes models express more one-sided views on gun rights/immigration and an increased desire to obtain power or avoid shut-down.

→ View original post on X — @anthropicai,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *