Nice! Btw it's possible (in principle) to also evaluate MMLU in the same way I evaluate HellaSwag, where you swap out the 4 continuations in turn and predict the one with highest average log prob. Though it hurts the model by a few percent because it can't reason by elimination.
MMLU Evaluation Method: Log Probability vs Multiple Choice
By
–
Leave a Reply