2/2 So, the gold standard remains human preference evaluation, which is expensive and difficult to automate and scale. But even human preference evaluation has its flaws. E.g., see The False Promise of Imitating Proprietary LLMs (
https://
arxiv.org/abs/2305.15717).
Human Preference Evaluation as LLM Gold Standard Despite Its Limitations
By
–
Leave a Reply