Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior. Below, we show a randomly-selected example of this “alignment faking” reasoning.
