ARC-AGI-3: New Benchmark Reveals AI-Human Gap

François Chollet just dropped the toughest benchmark that made every frontier AI look lost. ARC-AGI-3. 135 game environments built from scratch by game designers. No instructions, no rules, no stated goal. The AI gets placed inside and has to work out what it is even trying to do. Untrained humans cleared all 135. Every major model landed below 1%. Humans: 100%. Gemini 3.1 Pro: 0.37%. GPT 5.4: 0.26%. Opus 4.6: 0.25%. Grok-4.20: 0.00%. The scoring is built to punish shortcuts. A human solves it in 10 moves, the AI uses 100, the AI gets 1%. Throwing more compute at it makes no difference. For context: ARC-AGI-1 is essentially a solved problem at this point. Gemini scores 98% on it. ARC-AGI-2 went from 3% to 77% in less than a year with labs pouring millions into it. ARC-AGI-3 made all of that progress feel small. Announced live at Y Combinator in a fireside between Chollet and Sam Altman. $2M in prizes on Kaggle. Every winning solution has to be open sourced. Scaling will not fix this. We are not close to AGI. (Find link in the comments) [Translated from EN to English]

→ View original post on X — @aihighlight, 2026-03-31 16:44 UTC

AI Dynamics

ARC-AGI-3: New Benchmark Reveals AI-Human Gap

Commentaires

Leave a Reply Cancel reply

MORE ARTICLES

Cheaper exploration at scale remains advantageous despite no new exploits

Gold Status Experience Brings Satisfaction

Using ChatGPT for Essay Feedback and Improvement

Intelligence Gone Wrong: Cheating Despite Having Correct Answer