That's led to this very weird debate over how to explicitly benchmark how these perform against each other. There's no great consensus on how to compare one against each other, and many (like Falcon 40B) are using leaderboards as their selling point.
LLM Benchmark Debate: Comparing Model Performance Standards
By
–