BullshitBench: Opus 4.7 did WORSE than Opus 4.6 family. The 'Max' thinking version did worse than non-thinking – 74% 'pushback' vs 83% for non-thinking. As always, code, data etc is on github
By
–

BullshitBench: Opus 4.7 did WORSE than Opus 4.6 family. The 'Max' thinking version did worse than non-thinking – 74% 'pushback' vs 83% for non-thinking. As always, code, data etc is on github