5/ We thus have (at least) 3 serious codebases for evaluating on the same MMLU dataset:
– "Original implementation" from the MMLU benchmark authors
– "HELM implementation" from Stanford
– "Harness implementation" from EleutherAI (recently updated – see the end of the thread)
Three Major MMLU Evaluation Codebases Compared
By
–
Leave a Reply