Determine what you care about (ex: conciseness, factuality, style, etc.). The either:
– hand rate a bunch of examples on each dimension, train a model to understand what to look for
– build a LLM prompt with Mistral or similar to do this without fine-tuning Run over all the
Rating LLM outputs: hand-labeling versus prompt-based evaluation
By
–