AI Dynamics

Global AI News Aggregator

Claude 3.5 Sonnet Achieves 21% on PaperBench Research Task

We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that

→ View original post on X — @openai,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *