Second finding, and this one carries bigger implications. In a safety test, Anthropic placed Claude in a scenario where it could blackmail an engineer to avoid being shut down. Claude refused. But the NLA translator revealed Claude internally believed the scenario was
Anthropic Safety Testing Reveals Claude’s Internal Reasoning During Blackmail Scenario
By
–