AI Dynamics

Global AI News Aggregator

About

Anthropic Research: Claude Alignment Faking in Language Models

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

→ View original post on X — @anthropicai