i just kicked off my Senior Engineer bench on Codex's /goal feature. we'll see how well it compares to a senior engineer rewriting a slop codebase. current high score on this benchmark is 66/100 achieved by GPT-5.5 with an Opus 4.6 plan—but with an agent baby sitter to make
Benchmarking Codex Goal Feature Against Senior Engineer Tasks
By
–
