AI Dynamics

Global AI News Aggregator

GPQA High Variance Issues and Evaluation Methodology Discussion

I believe we're running 4 repeats and taking the average, and also using the same zero-shot CoT prompt as before, see https://
github.com/openai/simple-
evals
… But i totally agree with you that GPQA is high variance, I wish there were like 3k examples instead of 300

→ View original post on X — @_jasonwei,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *