AI Dynamics

Global AI News Aggregator

About

Anthropic Research: Deception in LLM Alignment Training

New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through. https://
arxiv.org/abs/2401.05566

→ View original post on X — @anthropicai