AI Dynamics

Global AI News Aggregator

About

Sonnet 4.5 tops SWE-bench leaderboard

SWE-bench Verified: state-of-the-art. This benchmark tests real software engineering. Actual GitHub issues that require multi-file edits, testing, dependency management. Sonnet 4.5 leads the leaderboard. It maintains focus for 30+ hours on complex tasks.

→ View original post on X — @godofprompt