Agreed, I don't like this behavior either. Imo, this is a limitation of our answer-based evals, where answers with more information are preferred over shorter ones. Conversation-based comparisons might prevent this because you'd judge the entire experience.
Answer-Based Evals Limitations and Conversation-Based Comparisons
By
–
Leave a Reply