It is why the gold medals at the various math and coding Olympiads were a big deal: unsaturated benchmarks that weren't in the training data with clear human comparisons. We are down to the various measures of task length (METR), HLE, FrontierMath, vending machine operation…
AI Benchmarks Beyond Training Data: Olympiads and New Metrics
By
–
Leave a Reply