In new Anthropic Fellows research, we discuss “introspection adapters": a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment.
Introspection Adapters Enable Language Models Self-Report Misalignment
By
–
