Engineering at Anthropic dropped another banger. Their internal playbook for evaluating AI agents. Here's the most counterintuitive lesson I learned from it: Don't test the steps your agent took. Test what it actually produced. This goes against every instinct. You'd think
