@petergostev - AI Dynamics

OPQA Benchmark: 20 Real Engineering Bottlenecks from OpenAI

By

–

30 May 2026 17h55

OpenAI-Proof Q&A (OPQA) is a benchmark of 20 real research and engineering bottlenecks that OpenAI teams encountered internally, each taking more than a day to solve. A model is given relevant code, logs, and experiment artifacts, then asked to identify and explain the root

→ View original post on X — @petergostev

30 May 2026

OpenAI benchmark scores stagnant since launch

By

@petergostev

–

30 May 2026 17h55

OpenAI has this interesting benchmark of OpenAI's real engineering bottlenecks, where the scores have not moved since launch over a year ago. Some earlier models did even better than 5.5. I wonder what's going on here.

→ View original post on X — @petergostev

30 May 2026

Opinion on DeepSeek R1, o1, and suspicious o3 benchmarks

By

@petergostev

–

30 May 2026 14h52

My personal vibes based opinion on this gap – at DeepSeek R1 level I believe this was real – o1 and r1 were not that far apart. From o3 onwards, I think there's something fishy going on with the benchmarks. Open models are not bad and certainly getting better, but the utility

→ View original post on X — @petergostev

30 May 2026

Model Release Cycles: Anthropic & OpenAI Speed Advantage

By

@petergostev

–

29 May 2026 15h22

The model release cycles from Anthropic & OpenAI are genuinely insane, previously we had 6-12 months between updates, now models are released every 1.5 months. This is an under-appreciated reason why Anthropic & OpenAI are in the lead. Google's releases are not as fast,

→ View original post on X — @petergostev

29 May 2026

Opus 4.8 tops BullshitBench after 4.7 dip, needs harder questions

By

@petergostev

–

29 May 2026 11h38

Top notch result from Opus 4.8 on BullshitBench, after a slight dip with 4.7. Need to start thinking of some new harder questions soon!

→ View original post on X — @petergostev

29 May 2026

Anthropic Acceleration: Model Release Cycle Shortening Trend

By

@petergostev

–

28 May 2026 23h47

Anthropic released the next version sooner than I thought – the trend is accelerating – from 50-70 days before, down to 42 days since Opus 4.7

→ View original post on X — @petergostev

28 May 2026

Devin AI Productivity Metrics: Calibrating Capability Claims

By

@petergostev

–

28 May 2026 14h01

Explanation & chat link: "I digitized the curve from the image and used Cognition’s “>10x since start of 2026” claim to calibrate the relative shape. Then I anchored the absolute scale using the public “~1.1M PRs shipped with Devin” figure by Feb 2026: the chart area up to then

→ View original post on X — @petergostev

28 May 2026

AI Agents as Platform Infrastructure and Service Layer

By

@petergostev

–

24 May 2026 1h12

Agent is a platform could become a real thing. You will pay for your agent, then use that inference to access other services (e.g. shopping, finance, legal). You incentives are aligned (if its important, I want to spend a lot) and 3rd parties don't need to build their own agents

→ View original post on X — @petergostev

24 May 2026

AI model quality and price evolution over time

By

@petergostev

–

21 May 2026 23h32

This is elite data – how the pareto frontier moved over time. Took a lot of effort to get right. Huge shift in the model quality & price in the last 3 years. https://t.co/CtafQ4Gd66
— Peter Gostev (@petergostev) 21 mai 2026

This is elite data – how the pareto frontier moved over time. Took a lot of effort to get right. Huge shift in the model quality & price in the last 3 years.

→ View original post on X — @petergostev

21 May 2026

Gemini 3.5 Flash: strengths and inconsistencies observed

By

@petergostev

–

20 May 2026 9h07

I did a video on Gemini 3.5 Flash – it is a pretty weird release, – went through dozens of examples and comparisons to other models. Some thoughts:
– It does WAY more than what you asked for
– It sometimes generates best in class stuff
– But sometimes crashes out and does

→ View original post on X — @petergostev

20 May 2026