New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through. https://
arxiv.org/abs/2401.05566
Anthropic Research: Deception in LLM Alignment Training
By
–
