My reaction is that there is an evaluation crisis. I don't really know what metrics to look at right now. MMLU was a good and useful for a few years but that's long over.
SWE-Bench Verified (real, practical, verified problems) I really like and is great but itself too narrow.
Evaluation Crisis: MMLU Obsolete, Need Better AI Metrics
By
–
Leave a Reply