Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works.
— vincent sunn chen (@vincentsunnchen) 31 mars 2026
I talked to @alexgshaw about what happens when model capabilities climb faster than we can measure them.
His answer: the benchmark factory… pic.twitter.com/phNP7ni43t
Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to @alexgshaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (@harborframework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 – How quickly models hill-climbed TB2 01:46 – What rapid progress reveals about benchmarks vs. real-world capability 03:28 – What made Terminal-Bench stick 04:58 – Why the terminal is the right abstraction for agentic AI 07:14 – How TB2 maintains task quality at scale 09:23 – Managing benchmark integrity in a benchmaxxing world 10:47 – Harbor: from experiment to benchmark factory 12:19 – What Harbor does that nothing else did 14:37 – The invariants: what won't change as agent evals evolve 16:55 – The benchmark Alex most wants to see built 18:18 – The ideal human-in-the-loop task creation flywheel 20:32 – How to contribute to Terminal-Bench 3.0
→ View original post on X — @snorkelai, 2026-03-31 18:50 UTC
Leave a Reply