I believe we're running 4 repeats and taking the average, and also using the same zero-shot CoT prompt as before, see https://
github.com/openai/simple-
evals
… But i totally agree with you that GPQA is high variance, I wish there were like 3k examples instead of 300
GPQA High Variance Issues and Evaluation Methodology Discussion
By
–
Leave a Reply