This quote is especially important:
> "Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can
Important quote: AIME and GPQA results strong but not user-visible
By
–
