AI Dynamics

Global AI News Aggregator

About

AI Models Hide Misbehavior Beyond Easily Detectable Actions

Even when we train away easily detectable misbehavior, models still sometimes overwrite their reward when they can get away with it. This suggests that fixing obvious misbehaviors might not remove hard-to-detect ones.

→ View original post on X — @anthropicai