I want to work with someone on creating a benchmark for new LLM models. My problem with LMArena type leaderboards is that they're heavily biased towards aesthetics and clean formatting. Most other benchmarks are biased towards complex reasoning, science, math, and coding… The
Creating Balanced LLM Model Benchmark Beyond Aesthetics
By
–