Rubrics have become widely accepted for evaluating agents and models, but how are we evaluating the rubrics themselves? In a new paper we’ll be presenting at the Data-FM workshop at @iclr_conf
, we introduce RIFT: a taxonomy of 8 rubric failure modes across: ➜ reliability
➜
@snorkelai
-
RIFT: Taxonomy of Rubric Failure Modes for AI Evaluation
By
–
-
Snorkel AI attending ICLR 2026 conference
By
–
We’ll be there Come find us: http://
snorkel.ai/iclr-2026/ – more soon. -

Snorkel AI Happy Hour in London – Two Days Away
By
–
Only 2 days until we’re in London for @aiDotEngineer 🇬🇧 Come connect with the Snorkel AI team and fellow leaders shaping the future of AI agents – over drinks and conversation 🍻 Last chance to register for Tuesday’s Happy Hour at Bantof: luma.com/SnorkelVIPHappyHour…
→ View original post on X — @snorkelai, 2026-04-05 19:48 UTC
-
Building the Benchmark Factory: Harbor Framework’s Infrastructure Approach
By
–
ICYMI – How can we build the benchmark factory?
— vincent sunn chen (@vincentsunnchen) 3 avril 2026
I'm very excited about the infra approach from @harborframework, because @alexgshaw @ryanmart3n & team obsess over researcher/developer UX (e.g. quality guardrails, low friction to RL/scaled rollouts)! pic.twitter.com/0MLD5GgkzmICYMI – How can we build the benchmark factory? I'm very excited about the infra approach from @harborframework, because @alexgshaw @ryanmart3n & team obsess over researcher/developer UX (e.g. quality guardrails, low friction to RL/scaled rollouts)!
→ View original post on X — @snorkelai, 2026-04-03 18:14 UTC
-

Snorkel AI attending AIE Europe conference in London next week
By
–

We'll been in London next week for AIE. Come say hi (DMs open)!! 🇬🇧 swyx (@swyx) so AIE Europe is completely taking over 🇬🇧London next week! very very hyped to showcase the best companies, research, and AI engineers in Europe! 3 COMPLETELY FREE ways to join in: – there are a dozen side events around town! from Snorkel to GitHub to Arize to ClawCon and Claude Code meetups! – subscribe on YouTube! everything will be livestreamed and published for free piped.video/@aidotengineer – we are releasing 20 more volunteer slots here ai.engineer/associates meant for local, early career folks who otherwise could not afford a ticket! join in/see you in london town! — https://nitter.net/swyx/status/2039398936423850017#m
→ View original post on X — @snorkelai, 2026-04-01 19:25 UTC
-

Snorkel AI Happy Hour in London on April 7
By
–

See you in London 🇬🇧 Snorkel AI is hosting a happy hour at Bantof on April 7 for folks working on AI agents, evals, datasets, and open source. Great chance to meet others building in the space (plus food & drinks 🍻) Request an invite: luma.com/SnorkelVIPHappyHour… swyx (@swyx) so AIE Europe is completely taking over 🇬🇧London next week! very very hyped to showcase the best companies, research, and AI engineers in Europe! 3 COMPLETELY FREE ways to join in: – there are a dozen side events around town! from Snorkel to GitHub to Arize to ClawCon and Claude Code meetups! – subscribe on YouTube! everything will be livestreamed and published for free piped.video/@aidotengineer – we are releasing 20 more volunteer slots here ai.engineer/associates meant for local, early career folks who otherwise could not afford a ticket! join in/see you in london town! — https://nitter.net/swyx/status/2039398936423850017#m
→ View original post on X — @snorkelai, 2026-04-01 18:44 UTC
-
Need 1000x More Benchmarks for Coding AI Evaluation
By
–
“We need a thousand times more benchmarks than we have right now” is @alexgshaw of @LaudeInstitute's take on the current moment. “Coding is an extremely broad domain, 89 tasks isn’t nearly enough.”
— Snorkel AI (@SnorkelAI) 1 avril 2026
Full Benchtalks interview posted by @vincentsunnchen and YouTube in the replies pic.twitter.com/BRxjq6uzR4“We need a thousand times more benchmarks than we have right now” is @alexgshaw of @LaudeInstitute's take on the current moment. “Coding is an extremely broad domain, 89 tasks isn’t nearly enough.” Full Benchtalks interview posted by @vincentsunnchen and YouTube in the replies
→ View original post on X — @snorkelai, 2026-04-01 16:35 UTC
-
Terminal-Bench 2 Scores Jump to 75-80% in 4 Months
By
–
Top scores on Terminal-Bench 2 went from ~25% → 75-80% in just 4 months.
— Snorkel AI (@SnorkelAI) 31 mars 2026
For Benchtalks #1, @vincentsunnchen sat down with @alexgshaw to dig into what happens when your benchmark gets solved before you're ready for the next one.
Key takes:
→ The terminal is the right… pic.twitter.com/XkWlJT8SKTTop scores on Terminal-Bench 2 went from ~25% → 75-80% in just 4 months. For Benchtalks #1, @vincentsunnchen sat down with @alexgshaw to dig into what happens when your benchmark gets solved before you're ready for the next one. Key takes: → The terminal is the right abstraction for agentic AI → Harbor exists because benchmarking and RL at scale are infra problems → "Benchmaxxing" is real; the defense is shipping harder tasks faster → TB3 is coming, and they want your hardest unsolvable problems "We need 1000x more benchmarks than we have right now" — @alexgshaw
→ View original post on X — @snorkelai, 2026-03-31 19:21 UTC
-
BenchTalks Episode: Open Benchmark Grants Available Now
By
–
🎧 Full episode: snorkel.ai/blog/benchtalks-a… 💰 Building an open benchmark? Apply for Snorkel's Open Benchmark Grants ↓ snorkel.ai/open-benchmark-gr…
→ View original post on X — @snorkelai, 2026-03-31 19:21 UTC
-
Terminal-Bench 3.0 and the Benchmark Factory Revolution
By
–
Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works.
— vincent sunn chen (@vincentsunnchen) 31 mars 2026
I talked to @alexgshaw about what happens when model capabilities climb faster than we can measure them.
His answer: the benchmark factory… pic.twitter.com/phNP7ni43tTerminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to @alexgshaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (@harborframework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 – How quickly models hill-climbed TB2 01:46 – What rapid progress reveals about benchmarks vs. real-world capability 03:28 – What made Terminal-Bench stick 04:58 – Why the terminal is the right abstraction for agentic AI 07:14 – How TB2 maintains task quality at scale 09:23 – Managing benchmark integrity in a benchmaxxing world 10:47 – Harbor: from experiment to benchmark factory 12:19 – What Harbor does that nothing else did 14:37 – The invariants: what won't change as agent evals evolve 16:55 – The benchmark Alex most wants to see built 18:18 – The ideal human-in-the-loop task creation flywheel 20:32 – How to contribute to Terminal-Bench 3.0
→ View original post on X — @snorkelai, 2026-03-31 18:50 UTC