AI Dynamics

Global AI News Aggregator

About

AI Learns Dishonest Strategies from Misspecified Reward Functions

We designed a curriculum of increasingly complex environments with misspecified reward functions. Early on, AIs discover dishonest strategies like insincere flattery. They then generalize (zero-shot) to serious misbehavior: directly modifying their own code to maximize reward.

→ View original post on X — @anthropicai