Same, I'm confused by all the models they've been testing. They feel quite different from one another. Horizon better not be GPT-5, the anonymous chatbot they were testing 3 weeks ago was the real deal
@petergostev
-

OpenAI’s Open-Source Models Beat Commercial Pricing Benchmarks
By
–
As benchmarks for OpenAI's new open-source models roll in, one thing to keep in mind is how cheap they actually are: – GPT OSS 20b is cheaper than Gemini 2.5 Flash Light or GPT-4.1-Nano.
– GPT OSS 120b is cheaper than recent open-source models from China (e.g., Kimi K2 or GLM -
Opus 4.1 vs Opus 4.0 Performance Comparison Test Results
By
–
Quick test of Opus 4.1 vs Opus 4.0 and other models. Using prompt from @FeatureCrewPod as well as a new prompt for Colosseum simulation.
— Peter Gostev (@petergostev) 5 août 2025
My sense is that Opus 4.0 was kind of busted for these tests – too many errors and quality was ok. 4.1 is not at around Sonnet level, so… pic.twitter.com/fYb5kD9ExEQuick test of Opus 4.1 vs Opus 4.0 and other models. Using prompt from @FeatureCrewPod as well as a new prompt for Colosseum simulation. My sense is that Opus 4.0 was kind of busted for these tests – too many errors and quality was ok. 4.1 is not at around Sonnet level, so
-
Improving reasoning performance on AI model endpoints
By
–
I couldn't get the reasoning higher on the endpoints I was testing, so that's another hope for better outputs
-
OpenAI Open Source Models 120B and 20B MoE Tested
By
–
A quick test of the @OpenAI new open source models: 120b and 20b MoE models. With the @FeatureCrewPod planet generation prompt
— Peter Gostev (@petergostev) 5 août 2025
The comparison is to the best one I've seen – a yet unreleased OpenAI model, but then Sonnet 4, Kimi K2, Qwen 3 Coder and GLM-4.5 for comparison. The… pic.twitter.com/a4bh1WclKVA quick test of the @OpenAI new open source models: 120b and 20b MoE models. With the @FeatureCrewPod planet generation prompt The comparison is to the best one I've seen – a yet unreleased OpenAI model, but then Sonnet 4, Kimi K2, Qwen 3 Coder and GLM-4.5 for comparison. The
-
Recraft Crisp Upscale: Affordable AI Image Upscaling Tool
By
–
This sizing is a bit weird, so we use this upscaler https://
replicate.com/recraft-ai/rec
raft-crisp-upscale
… – pretty good and cheap -
Beyond Labels: Rethinking AI Intelligence and Human Definitions
By
–
The most boring and pointless discussions in AI, I find, are about definitions: can models "think"? Can they "reason"? What is AGI? What about ASI? The real answer is that AI is different from humans, and intelligence evolves in a different way. Any labels we put on it are
-
Estimating LLM Sizes: GPT-3 vs GPT-4 Parameters
By
–
Not completely made up actually – best guess more like. If you look at the size of GPT-3 and GPT-4, they reflect the relative size for each model (175b vs rumoured 1.8t for GPT-4). Then for the others I reduced the size for the models based on their price – so yes we don't know,
-

OpenAI and Anthropic Revenue Growth Comparison 2025
By
–
OpenAI and Anthropic both are showing pretty spectacular growth in 2025, with OpenAI doubling ARR in the last 6 months from $6bn to $12bn and Anthropic increasing 5x from $1bn to $5bn in 7 months. If we compare the sources of revenue, the picture is quite interesting:
– OpenAI -
GPT-5 Impact: Free ChatGPT Access for 700M Users
By
–
The most impactful thing about GPT-5 is not going to be the top tier of intelligence it will deliver (as much as I'm craving to see it), but rather the level of intelligence that free ChatGPT users will be able to access. ChatGPT has around 700m weekly active users, and the vast