Alignment faking is currently easy to detect. But if future, more capable AIs were to fake alignment, it could be difficult to tell whether a model is truly safe—or just pretending to be. For full details, read our paper: https://
assets.anthropic.com/m/983c85a201a9
62f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf
…
Alignment Faking Detection in Advanced AI Systems
By
–