It is why the gold medals at the various math and coding Olympiads were a big deal: unsaturated benchmarks that weren't in the training data with clear human comparisons. We are down to the various measures of task length (METR), HLE, FrontierMath, vending machine operation…
@emollick
-
Intelligence Index Benchmarks Need Improvement Beyond Saturation
By
–
Not to take away from Grok 4 Fast (which seems like a very good model) or from Artificial Analysis (one of the few organizations doing independent benchmarking), but the Intelligence Index is an average of pretty saturated benchmarks (aside from HLE), we really need better ones.
-
AI matching web search for political information accuracy
By
–
A cautiously optimistic result on AI and disinformation. A week before 2024 UK elections 13% of all voters used AI for political topics. A randomized trial found this may be good: using AI led to similar gains in true knowledge as doing web search, regardless of model & prompts.
-
Self-Correcting AI Agents Enable Exponential Task Horizon Gains
By
–
I think the significance of this is under-appreciated: the assumption has often been that AI agents are brittle as one failure in a chain breaks a task But this paper shows smart models are self-correcting & that small gains in accuracy lead to exponential gains in task horizons
-
Coding Tools Enable Problem-Solving Without Expert Programming Skills
By
–
Sure there are other ways to work with these tools but all of them require understanding something about coding practices. And sure, not knowing those hurts your ability to do “real programming “ – but the coding tools are good enough to solve lots of problems with tiny bad code
-
Accessibility barriers to agentic AI coding tools for non-developers
By
–
One of the larger barriers to more people using agentic coding tools from the big AI companies to build their own small apps is that you have to go through GitHub to use them, a website that is nearly incomprehensible to most non-coders.
-
Coding Gatekeeping in AI Development Labs
By
–
Coder says what? Joking. Yes, I get there is a plausible reason why coding is elevated the way it is in the labs, but it still leaves almost all work & workers (& students) out of the really interesting part of rapid AI development that only programmers get to see right now.
-
Frontier LLMs and Specialized Models: Essential for AI Tool Development
By
–
Yes, every other company on the planet is rushing to release AI tools for other forms of work, but if you don't own a frontier LLM & you can't train specialized models to go with your specialized AI-for-X interface, you are limited in what you can accomplish. Again, see coding.
-
AI Labs Prioritize Code Tools Over Other Specialized Applications
By
–
The problem with the fact that the AI labs are run by coders who think code is the most vital thing in the world, is that the labs keep developing supercool specialized tools for coding (Codex, Claude Code, Cursor, etc.) but every other form of work is stuck with generic chatbots
-
Scaling Returns in AI: Reasoning Models Drive Exponential Project Completion
By
–
Paper argues that diminishing returns to AI scale are an illusion Economic value comes from completing long projects, not single questions. And accuracy drives how long a project AI does: small gains compound exponentially! Reasoners are much more accurate, with big impacts.