The number of AI influencers who are surprised that Sonnet 4.5 didn't achieve a better position on the METR benchmark, when they were saying it "could work autonomously for 30 hours," worries me. They're not understanding anything about what these benchmarks measure. They're
AI Influencers Misunderstand METR Benchmark Results for Sonnet
By
–
