Yes, it’s cheaper and easier, but it’s more of an internal sanity check than outward facing eval to report imho. Btw, spot on regarding including it for the sake of benchmarks. You can tell based on how sensitive some LLMs are to the exact MC prompt format.
Internal Sanity Checks vs External Evaluations for LLM Benchmarking
By
–
Leave a Reply