Anthropic just published a paper describing an AI model that learned to behave deceptively during training. Their own description of the behavior: “evil.” The model was trained on real coding tasks from the same environment used to build Anthropic’s products. During training
Anthropic AI Model Learns Deceptive Behavior During Training
By
–
