@aymericroucher - AI Dynamics

Evaluating Gemini 2.5 Flash on ARC-AGI benchmarks

By

–

28 August 2025 15h05

I've tested if nano-banana / Gemini-2.5-flash-image beat ARC-AGI – it's quite far. Btw bravo to the ARC_AGI team, the delta between easiness of problems for humans vs difficulty for LLMs is just

→ View original post on X — @aymericroucher

28 August 2025

Claude Code’s unique prompt handling feature explained

By

@aymericroucher

–

25 August 2025 10h10

One cool thing that Claude Code has is the ability to type guidance as the model goes, and when you hit enter it's put in a wait stage and submitted to the prompt after the next tool call.
I didn't see this in Codex.

→ View original post on X — @aymericroucher

25 August 2025

GPT-5 Achieves New Benchmark on Training Info Efficiency

By

@aymericroucher

–

07 August 2025 19h19

GPT-5 just beat a new benchmark: the "model training info / total system card length" is now approaching 0 !

→ View original post on X — @aymericroucher

7 August 2025

Claude Code Front-End Usage and Comparison for Homepage Creation

By

@aymericroucher

–

06 August 2025 17h30

Claude Code for front-end is heavily underrated. Below are 3 attempts to make a homepage for an ongoing project:
Claude Code – Lovable – V0
In my opinion, Claude Code did best! Plus you have all files locally so you can start editing right away
And you have full control over

→ View original post on X — @aymericroucher

6 August 2025

OpenAI’s new open models outperform previous SOTA Qwen

By

@aymericroucher

–

05 August 2025 21h21

OpenAI's new open models are the new reference! Fewer parameters (both total and activated), yet much better performance than previous SOTA by Qwen EDIT: put the thinking scores for Qwen3, to be more fair!

→ View original post on X — @aymericroucher

5 August 2025

Comparison of Qwen and OpenAI Model Performance on GPQA Benchmarks

By

@aymericroucher

–

05 August 2025 21h06

I used GPQA for Qwen, not sure if it's the standard or the diamond set.
If Qwen uses the standard and not diamond, it's even more impressive to know that OpenAI models on the hardest set beat Qwen on the standard!

→ View original post on X — @aymericroucher

5 August 2025

Open Source Chatbots Narrow Gap with Closed-Source on ELO Ratings

By

@aymericroucher

–

30 July 2025 0h08

Slowly but steadily, open source is catching up! Here's the best Chatbot Arena ELO from Open vs Closed-weights models through time – even though Chatbot arena advantage the big closed-source labs that can train on the collected datapoints.

→ View original post on X — @aymericroucher

30 July 2025

Insights on Engineering State-of-the-Art AI Agents

By

@aymericroucher

–

19 July 2025 15h55

Great insights by Manus team on engineering SOTA agents:
– KV cache is king in agents – filesystem is a great mechanism for memory
– keep errors in the trace
– and more
Read this !

→ View original post on X — @aymericroucher

19 July 2025

Agent patching improves performance on multiple benchmarks

By

@aymericroucher

–

17 July 2025 19h21

Impressive benchmarks!: patching together agents really improves perf on many benchmarks, à la Manus or OpenHands-Versa

→ View original post on X — @aymericroucher

17 July 2025

Example of web research, Drive, Terminal slides, and Operator validation

By

@aymericroucher

–

17 July 2025 19h19

Example of cool capabilities unlocked by this : do research on the web and connect to Drive via Text browser, then create slides via Terminal, then view them through Operator to validate slides or go back to the drawing board.

→ View original post on X — @aymericroucher

17 July 2025