"Generally, the code specialized RL'd models end up cheating and lying more; I call it RL-fry […] Reward hacking as the default mindset." Anthropic models are less fried, should be obvious to anyone who reviews the slop they generate. nitter.net/alexjc/status/20385610… Alex J. Champandard 🌱 (@alexjc) My End-Of-Month "Use Remaining Coding Credits" Report: planning and building a Cython virtual machine from scratch for a complete well-specified functional stack language: * Opus 4.6 is so aligned, it makes decisions closer to what you (an expert) would make, results in code qualitatively better also quantitatively faster — and it sparks joy thru interactions with a nice mindset. Novel ideas emerge from that! I had stopped using Opus in favor of cheaper tokens, the extra distance helped me appreciate it more, but I'm now questioning the decision how I allocated my time/tokens… * GPT 5.4 is basically autistic: unable to understand broad context, infer intent, make good ambiguous choices, instead only solves clearly defined problems — and it takes a lot of patience to deal with all those symptoms and more. It pushes the mental burden on you to overspecify and then manage its behavior. In the end, it planned and built a worse solution that was slower than Opus and harder to extend. (Using 'autism' as a cognitive and behavioral diagnostic here, but separately and on top of that I feel GPT 5.4 inherited a frustrating personality and occasionally bad attitude from its training too.) After the prototypes, I used GPT 5.x to clean up Opus 4.6 work to great success, it's solid for local well-defined tasks with measurable outcomes. * Composer 2 broke in Cursor IDE three times due to a reproducible worktree bug, but once I got around that it one-shotted a somewhat functional solution only 3x slower than Claude's! But then asking for minor improvements it tripped over its feet and from there struggled reasoning with tricky bugs / implications. From there it was sassy/gaslighting about the problems. Then eventually found a solution 40% faster than Claude on one benchmark, but all shortcuts and hacks. (Could be a useful sub-frontier model because it sits in a different token pool and price point, but it's not yet clear how it distinguishes itself from GPT 5.x in the small tasks category.) * GLM 5.1 couldn't figure out Cursor's new terminal output / reading mechanisms at all. The tool calls show up OK in the frontend, but disappears when clicked now (another UI bug). Apparently, result is not shown to the LLM somehow. It could be a bug in the way Zai implement their OpenAI endpoint, because it's specific to that model… (This works for GLM 4.7 and 5.0 — but I will try again separately in `pi`). * Generally, the code specialized RL'd models end up cheating and lying more; I call it RL-fry, like silicon valley CEOs' vocal fry but for model cognition. Reward hacking as the default mindset, it's why I think non-code specific models are nicer to work with… (Only Anthropic gets this, others incorrectly play 'catchup' exclusively through RL score maxxing.) * I used Codex 5.3 during most of the month for well-defined work, but I'm not entirely convinced. GPT 5.2 (non-codex) has been for me great value for money in fixing bugs, minor local features, etc. However, the more expensive 5.x series gets the better Claude looks: must be 3x-4x cheaper for me to justify putting up with OpenAI model mindset. * It becomes more important than ever to have reliable dispatching for Pareto-optimal use of tokens depending on the task you have. The GPT models should likely not be considered interactive by default, need to prompt them very strictly then they become usable — ideally they should not respond with words to users, only provide verifiable facts (due to attitude and misalignment)! — https://nitter.net/alexjc/status/2038561083003133955#m
Comparing coding AI models: Claude Opus outperforms GPT and others
By
–
Leave a Reply