Maybe we should come up with some more world model tests that could show it more definitively, one thing that is hard is whether we are just testing the language component of the model which confuses things. I haven't obviously noticed that GPT was worse at world model stuff, but