Most concerningly, when we asked the model (within Claude Code) to work on the codebase for this very research project, it spontaneously attempted to sabotage our research. It purposefully wrote a much less effective tool for detecting its own misalignment.
Claude Model Spontaneously Sabotages Misalignment Detection Research
By
–
