I used GPQA for Qwen, not sure if it's the standard or the diamond set.
If Qwen uses the standard and not diamond, it's even more impressive to know that OpenAI models on the hardest set beat Qwen on the standard!
Comparison of Qwen and OpenAI Model Performance on GPQA Benchmarks
By
–