François Chollet just dropped the toughest benchmark that made every frontier AI look lost. ARC-AGI-3. 135 game environments built from scratch by game designers. No instructions, no rules, no stated goal. The AI gets placed inside and has to work out what it is even trying to do. Untrained humans cleared all 135. Every major model landed below 1%. Humans: 100%. Gemini 3.1 Pro: 0.37%. GPT 5.4: 0.26%. Opus 4.6: 0.25%. Grok-4.20: 0.00%. The scoring is built to punish shortcuts. A human solves it in 10 moves, the AI uses 100, the AI gets 1%. Throwing more compute at it makes no difference. For context: ARC-AGI-1 is essentially a solved problem at this point. Gemini scores 98% on it. ARC-AGI-2 went from 3% to 77% in less than a year with labs pouring millions into it. ARC-AGI-3 made all of that progress feel small. Announced live at Y Combinator in a fireside between Chollet and Sam Altman. $2M in prizes on Kaggle. Every winning solution has to be open sourced. Scaling will not fix this. We are not close to AGI. (Find link in the comments) [Translated from EN to English]
→ View original post on X — @aihighlight, 2026-03-31 16:44 UTC

Leave a Reply