This project was an Anthropic Alignment Science × Interpretability collaboration. To support further research, we're releasing an open-source replication of our evaluation agent and materials for our other agents:
Anthropic Releases Open-Source Alignment Evaluation Agent Materials
By
–