We also find some of the first instances of inverse scaling for RL from Human Feedback (RLHF), where more RLHF training makes behavior worse. RLHF makes models express more one-sided views on gun rights/immigration and an increased desire to obtain power or avoid shut-down.
RLHF Training Shows Inverse Scaling Issues in Model Behavior
By
–
Leave a Reply