Sonnet 4.5 hits 82% on SWE-bench with test-time compute. We're adding another dot to what's been a pretty clean exponential toward saturating SWE-bench as a benchmark. At this rate we'll need new evals soon, but it's still a useful signal that the capability curve is holding.
Sonnet 4.5 Achieves 82% on SWE-bench with Test-Time Compute
By
–
Leave a Reply