The “Clock” benchmark measures the models' ability to recognize the time. I don't know if I'm more surprised that less than 90% of people can read a clock themselves, or that the best models currently don't exceed 14% accuracy. Anyway, cool benchmark!
Clock Benchmark: Models Struggle with Time Recognition Task
By
–
