Most AI companies test their models on generic benchmarks.
— God of Prompt (@godofprompt) 15 décembre 2025
"Can it code?" "Can it do math?" "Can it write an essay?"
PolyAI built their own test: did the customer's issue get resolved?
That's it. That's the whole benchmark.
And it's even trusted with suicide hotlines and… pic.twitter.com/wUR9Pk1DqR
Most AI companies test their models on generic benchmarks. "Can it code?" "Can it do math?" "Can it write an essay?" PolyAI built their own test: did the customer's issue get resolved? That's it. That's the whole benchmark. And it's even trusted with suicide hotlines and