
On M-Cube: All models failed.
Zero percent on full tasks.
Even with 10,000+ tokens of “reasoning.” On the simplified version? GPT-o3 barely crossed 72% – after reducing the search space by 5 million-fold.
By
–


On M-Cube: All models failed.
Zero percent on full tasks.
Even with 10,000+ tokens of “reasoning.” On the simplified version? GPT-o3 barely crossed 72% – after reducing the search space by 5 million-fold.