Even frontier models fail all the time. The difference is we fail in 15 sec and with one more prompt you get the right answer. Time to success = 30 sec. On GPT Codex it takes 22min just to find out it failed.
Frontier Models vs GPT Codex: Speed to Success Comparison
By
–