AI Dynamics

Global AI News Aggregator

Terminal-Bench 3.0 and the Benchmark Factory Revolution

Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to @alexgshaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (@harborframework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 – How quickly models hill-climbed TB2 01:46 – What rapid progress reveals about benchmarks vs. real-world capability 03:28 – What made Terminal-Bench stick 04:58 – Why the terminal is the right abstraction for agentic AI 07:14 – How TB2 maintains task quality at scale 09:23 – Managing benchmark integrity in a benchmaxxing world 10:47 – Harbor: from experiment to benchmark factory 12:19 – What Harbor does that nothing else did 14:37 – The invariants: what won't change as agent evals evolve 16:55 – The benchmark Alex most wants to see built 18:18 – The ideal human-in-the-loop task creation flywheel 20:32 – How to contribute to Terminal-Bench 3.0

→ View original post on X — @snorkelai, 2026-03-31 18:50 UTC

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *