RLHF Training Shows Inverse Scaling Issues in Model Behavior

AI Dynamics

Global AI News Aggregator

RLHF Training Shows Inverse Scaling Issues in Model Behavior

–

19 December 2022 17h57

We also find some of the first instances of inverse scaling for RL from Human Feedback (RLHF), where more RLHF training makes behavior worse. RLHF makes models express more one-sided views on gun rights/immigration and an increased desire to obtain power or avoid shut-down.

→ View original post on X — @anthropicai,

19 December 2022

AI ETHICS GENERATIVE AI LLMS MACHINE LEARNING RESEARCH SAFETY

AI Dynamics

RLHF Training Shows Inverse Scaling Issues in Model Behavior

Commentaires

Leave a Reply Cancel reply

MORE ARTICLES

Cheaper exploration at scale remains advantageous despite no new exploits

Gold Status Experience Brings Satisfaction

Using ChatGPT for Essay Feedback and Improvement

Intelligence Gone Wrong: Cheating Despite Having Correct Answer