I think those scores aren't so much general accuracy as they are how many of the increasingly weird challenges in the benchmark the models get right – depending on the application I would expect a model that gets 70% on the benchmarks would get closer to 100% in actual use
Benchmark Accuracy vs Real-World Model Performance Analysis
By
–
Leave a Reply