In the spirit of being very meta here. Here's my personal meta-review of all the leaderboard-ing methodologies. 1. I like the elo ranking based on chatbot arena from @lmsysorg 2. LM harness (e.g., zero-shot PIQA, Hellaswag etc) is the equivalent of "MNIST" for LLMs. Okay-ish
Meta-Review of LLM Leaderboard Evaluation Methodologies
By
–
Leave a Reply