Terminal-Bench is challenging even for the most advanced agents: • OpenAI's Codex (gpt-5-codex): 42.8% verified score • Anthropic’s Claude Code (claude-sonnet-4-5): 50.0% per their release announcement • Leaderboard:
Advanced AI Agents Benchmark Performance Comparison
By
–