Use Hellaswag and ARC metrics for reasoning tasks, MMLU and Truthful QA for truthful LLMs, and HumanEval for coding-oriented LLMs. They'll test your model's ability and reveal overlooked weaknesses.
Key Benchmarks for Evaluating LLM Reasoning and Coding Abilities
By
–
Leave a Reply