For more information about our RM-sycophantic model, our auditing game, alignment auditing techniques, and a nuanced discussion of the value LLM interpretability provides for alignment auditing, read our paper: https://
assets.anthropic.com/m/317564659027
fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf
…
Auditing Language Models for Hidden Objectives and Alignment
By
–