This is a laudable goal, but it may mean designing, testing, and deploying AI systems differently. For example, why not measure the ability of GPT-5.5 to solve problems *with* a person rather than on its own? (see Stanford's centaur
Why Measure AI Performance With Humans Rather Than Alone
By
–