Long-form eval breaks on novelty. A rubric written before the research exists can't score research that shifts the criteria. 2,500 rubrics is a real dataset, but the measurement question is whether the set handles reports that break the rubric assumptions. That's where deep
Evaluation Rubrics Fail on Novel AI Research Paradigms
By
–