LLM benchmark question: benchmarks like MMLU do a lot of testing for knowledge – what are the most interesting benchmarks for if I don't care as much about what the model "knows" but more about how good it is at tasks like summarization, data extraction and RAG Q&A against input?
Leave a Reply