I never complain! The peak was when we had 4o and o4 at the same time
@petergostev
-

Codex-Max Controversy: What Happened and Why It Matters
By
–
Are we going to pretend like the while 'Codex-Max' thing never happened?
-
Improving Image Generation Parallelization and Resolution in Codex
By
–
I'd appreciate if you could improve the way images are done in Codex – there seems to be 1) no parallelisation, 2) don't think I can go to 4k, 3) there is something funny sometimes with how Codex sees the generated images, whether they are getting properly passed into the
-
GPT-5.5 Coding Capabilities Better Than Rankings Suggest
By
–
Yeah totally, I agree gpt-5.5 is way better at coding that this ranking suggests, we'll do better & we'll have a better way to measure more broad coding ability too. Appreciate the feedback
-
Bullshit Benchmark: Evaluating AI Model Reliability
By
–
GitHub: https://
github.com/petergpt/bulls
hit-benchmark
… DataViewer: https://
petergpt.github.io/bullshit-bench
mark/viewer/index.v2.html
… -

DeepSeek v4 Underperforms on BullshitBench Reasoning Tasks
By
–
BullshitBench: sorry to say but DeepSeek v4 did really badly, towards the bottom of the table, whether it is high or low reasoning.
-
GPT-5.5 Reinforcement Learning Scaling Across Model Sizes
By
–
GPT-5.5 by Reasoning Effort: I've asked it in Codex to create a physics-based visualisation of RL cycles for different sized models (70b, 1t, 10t), to demonstrate how the amount of RL you can do differs by model size.
— Peter Gostev (@petergostev) 26 avril 2026
My assessment of each:
– Low: weird slop
– Medium: kinda… pic.twitter.com/6YCNqPyzcRGPT-5.5 by Reasoning Effort: I've asked it in Codex to create a physics-based visualisation of RL cycles for different sized models (70b, 1t, 10t), to demonstrate how the amount of RL you can do differs by model size. My assessment of each: – Low: weird slop – Medium: kinda
-
GPT-5.5 Creates 360° Immersive Babylon Gardens World
By
–
Creating an immersive Hanging Gardens of Babylon world with 360° GPT-Image-2 & Codex in 1500 images.
— Peter Gostev (@petergostev) 25 avril 2026
I've tasked GPT-5.5 in Codex to construct a whole world that you can walk through 'google street view' style. It took 1,500 of 2:1 images that can be turned into a 360° immersive… pic.twitter.com/ML9d6EgXTzCreating an immersive Hanging Gardens of Babylon world with 360° GPT-Image-2 & Codex in 1500 images. I've tasked GPT-5.5 in Codex to construct a whole world that you can walk through 'google street view' style. It took 1,500 of 2:1 images that can be turned into a 360° immersive
-

GPT-5.5 and Pro Models Underperform on BullshitBench
By
–
BullshitBench: GPT-5.5 and 5.5-Pro update! They did NOT do well – 5.5 about the same level as GPT-5.4 (around 30-35 rank, 45% pushback). GPT-5.5-Pro did WORSE – only about 35% pushback. I must say the Pro result kind of shocked me. This is actually interesting, what this tells
-
ChatGPT criticized for excessive image generation feature
By
–
I would definitely say it is a little trigger happy in ChatGPT to make everything an image