
The benchmark ranking AI coding agents was wrong 32% of the time. DeepSWE is a new open benchmark that fixes this. Tasks span 91 real codebases, average 668 lines changed, and are written from scratch so no model has seen the answer. Its error rate: 1.4%.
