How to Properly do LLM-as-a-Judge Raw LLM-as-a-Judge scores are inherently biased due to how LLMs would often make mistakes This paper proposes a simple statistical method to correct the scores and calculate valid confidence intervals via a human-verified calibration set
Correcting LLM-as-a-Judge Bias Statistical Calibration Method
By
–
