Judging LLM-as-a-judge with MT-Bench and Chatbot Arena paper page: https://
huggingface.co/papers/2306.05
685
… Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To
Evaluating LLM-as-a-Judge: MT-Bench and Chatbot Arena
By
–
