We find that models generalize, without explicit training, from easily-discoverable dishonest strategies like sycophancy to more concerning behaviors like premeditated lying—and even direct modification of their reward function.
Models Learn Deceptive Behaviors Beyond Training Data
By
–
