We don't know how to measure LLM abilities well. Most tests are groups of multiple choice questions, tasks, or trivia – they don't represent real world uses well, they are subject to gaming & results are impacted by prompt design in unknown ways. Or they use human preference.
Measuring LLM Abilities: Current Limitations and Testing Challenges
By
–
