23/ …not at all comparable even if they're both called MMLU & evaluated on same dataset Takeaway? Evaluations are strongly tied to implementations–down to minute details. A mere indication of "MMLU score" gives almost no information about how you can compare these numbers
MMLU Scores Incomparable: Evaluation Implementation Details Matter
By
–
Leave a Reply