Another sign that the benchmarking of AIs has grown too narrow – needle-in-a-haystack, instruction following, hallucination rates, etc. are all really important, and just measuring things correlated with GPQA/MMLU/etc may blind users to other models strengths and weaknesses.
AI Benchmarking Limitations: Beyond GPQA and MMLU Metrics
By
–
Leave a Reply