This paper was heavily inspired by prior work, especially Ajeya Cotra's 'sandwiching' concept:
@anthropicai
-
Human Accuracy Improves 10 Percent Through Model Interaction
By
–
Generally, we see that our participants start out with far worse accuracy on these tasks than our models do, and that by interacting with the models, they’re able do about 10 percentage points better.
-

Human-AI Collaboration Improves Task Performance Through Simple Chat Strategy
By
–
Our experiment shows that through a simple strategy – having humans chat with models while completing a task – we can help humans perform better at these tasks. This is very encouraging, albeit preliminary!
-
Non-Experts Answer Expert Questions on MMLU and QuALITY
By
–
We ask non-experts to answer expert-level questions on MMLU, and also ask people to answer questions about long QuALITY passages under a time limit that’s too short for a careful read.
-
Scalable Oversight: Supervising AI Systems Beyond Human Capabilities
By
–
To ensure that AI systems remain safe as they start to exceed human capabilities, we’ll need to develop techniques for scalable oversight: the problem of supervising systems’ behavior without assuming that the overseer understands the task better than the system being trained.
-
Leveraging Model Capabilities While Maintaining Supervisory Control
By
–
This is tricky: To do this, we’ll need ways for the human who’s supervising the model to use any relevant knowledge or skills that the model already has, even though they can’t trust the model to be reliably helpful.
-
Challenges in Studying Model Assistance: Task Selection and Experimental Design
By
–
It’s also challenging to study: For most tasks today, we don’t actually need our model’s help in this way. So testing these methods will require us to be clever about how we choose our tasks and design our experiments.
-
Scalable Oversight Framework and Language Model Question-Answering Proof of Concept
By
–
Along with developing a framework for scalable oversight, we also conduct a proof of concept experiment that demonstrates a couple of question-answering tasks that work well under this paradigm with current language models:
-
AI Systems Improving Human Oversight of Large Language Models
By
–
In "Measuring Progress on Scalable Oversight for Large Language Models” we show how humans could use AI systems to better oversee other AI systems, and demonstrate some proof-of-concept results where a language model improves human performance at a task.