AI Dynamics

Global AI News Aggregator

Three Major MMLU Evaluation Codebases Compared

5/ We thus have (at least) 3 serious codebases for evaluating on the same MMLU dataset:
– "Original implementation" from the MMLU benchmark authors
– "HELM implementation" from Stanford
– "Harness implementation" from EleutherAI (recently updated – see the end of the thread)

→ View original post on X — @thom_wolf,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *