We evaluate replication attempts using detailed rubrics co-developed with the original authors of each paper. These rubrics systematically break down the 20 papers into 8,316 precisely defined requirements that are evaluated by an LLM judge.
Evaluating AI Paper Replication with Detailed LLM-Based Rubrics
By
–
Leave a Reply