(Since I am on a benchmark theme today) The ARC team does well keeping AI labs honest about their benchmarks, including showing that Qwen's big ARC-AGI performance doesn't replicate But ARC-AGI also has a strong philosophy of what AI should do. We need other benchmarking efforts