This is a really nicely done paper, and a good example of how experts can not only design good benchmarks but also better diagnose what the AI is doing wrong. In this case, not turning to lookup tables and not doing math with coding tools (both of these seem solvable in time)
Expert Benchmarking: Diagnosing AI Mathematical and Lookup Weaknesses
By
–
