@anthropicai

–

08 November 2022 17h33

This paper was heavily inspired by prior work, especially Ajeya Cotra's 'sandwiching' concept:

Human Accuracy Improves 10 Percent Through Model Interaction

By

–

08 November 2022 17h33

Generally, we see that our participants start out with far worse accuracy on these tasks than our models do, and that by interacting with the models, they’re able do about 10 percentage points better.

Human-AI Collaboration Improves Task Performance Through Simple Chat Strategy

By

–

08 November 2022 17h33

Our experiment shows that through a simple strategy – having humans chat with models while completing a task – we can help humans perform better at these tasks. This is very encouraging, albeit preliminary!

Non-Experts Answer Expert Questions on MMLU and QuALITY

By

–

08 November 2022 17h33

We ask non-experts to answer expert-level questions on MMLU, and also ask people to answer questions about long QuALITY passages under a time limit that’s too short for a careful read.

Scalable Oversight: Supervising AI Systems Beyond Human Capabilities

By

–

08 November 2022 17h33

To ensure that AI systems remain safe as they start to exceed human capabilities, we’ll need to develop techniques for scalable oversight: the problem of supervising systems’ behavior without assuming that the overseer understands the task better than the system being trained.

Leveraging Model Capabilities While Maintaining Supervisory Control

By

–

08 November 2022 17h33

This is tricky: To do this, we’ll need ways for the human who’s supervising the model to use any relevant knowledge or skills that the model already has, even though they can’t trust the model to be reliably helpful.

Challenges in Studying Model Assistance: Task Selection and Experimental Design

By

–

08 November 2022 17h33

It’s also challenging to study: For most tasks today, we don’t actually need our model’s help in this way. So testing these methods will require us to be clever about how we choose our tasks and design our experiments.

Scalable Oversight Framework and Language Model Question-Answering Proof of Concept

By

–

08 November 2022 17h33

Along with developing a framework for scalable oversight, we also conduct a proof of concept experiment that demonstrates a couple of question-answering tasks that work well under this paradigm with current language models:

AI Systems Improving Human Oversight of Large Language Models

By

–

08 November 2022 17h33

In "Measuring Progress on Scalable Oversight for Large Language Models” we show how humans could use AI systems to better oversee other AI systems, and demonstrate some proof-of-concept results where a language model improves human performance at a task.