AI Dynamics

Global AI News Aggregator

@dair_ai

  • Natural-Language Agent Harnesses: Making AI Agent Control Portable and Inspectable
    Natural-Language Agent Harnesses: Making AI Agent Control Portable and Inspectable

    Agent harnesses are too restrictive. That's because they're still designed as code. What if the harness itself were written in natural language and interpreted by an LLM at runtime? This research explores the idea. The work introduces Natural-Language Agent Harnesses (NLAHs), a structured natural-language representation that externalizes harness logic as a portable, executable artifact. Instead of scattering control flow across controller code, framework defaults, and tool adapters, NLAHs make contracts, roles, stage structure, state semantics, and failure taxonomies explicit and editable. An Intelligent Harness Runtime (IHR) places an LLM inside the runtime loop to interpret and execute these harnesses directly. Why does it matter? Harness design is increasingly decisive for agent performance, but it's buried in code that's hard to transfer, compare, or ablate. NLAHs make the orchestration layer a first-class scientific object. The practical implication: harnesses become portable across runtimes, composable across tasks, and directly inspectable by humans and models alike. Paper: arxiv.org/abs/2603.25723 Learn to build effective AI agents in our academy: academy.dair.ai/

    → View original post on X — @dair_ai, 2026-03-31 13:14 UTC

  • Meta-Harness: Automated System Achieves 6x Performance Improvement
    Meta-Harness: Automated System Achieves 6x Performance Improvement

    NEW Stanford & MIT paper on Model Harnesses. Changing the harness around a fixed LLM can produce a 6x performance gap on the same benchmark. What if we automated harness engineering itself? The work introduces Meta-Harness, an agentic system that searches over harness code by exposing the full history through a filesystem. The proposer reads source code, execution traces, and scores from all prior candidates, referencing over 20 past attempts per step. On text classification, it improves over SOTA context management by 7.7 points while using 4x fewer tokens. On agentic coding, it outperforms all hand-engineered baselines on TerminalBench-2, scoring 37.6% versus Claude Code's 27.5%. This is a big deal! Here is why: The harness around a model often matters as much as the model itself. Meta-Harness shows that giving an optimizer rich access to prior experience, not just compressed scores, unlocks automated engineering that beats human-designed scaffolding. Paper: arxiv.org/abs/2603.28052 Learn to build effective AI agents in our academy: academy.dair.ai/

    → View original post on X — @dair_ai, 2026-03-31 13:13 UTC

  • Coding Agents Excel at Processing Massive Long-Context Documents
    Coding Agents Excel at Processing Massive Long-Context Documents

    // Coding Agents are Effective Long-Context Processors // We are just touching the surface of what's possible with coding agents. LLMs struggle with long contexts, even the ones that support massive context windows. It turns out coding agents already know how to solve this; you just need to reframe the problem. This work places massive text corpora into directory structures and lets off-the-shelf coding agents (Codex, Claude Code) navigate them with terminal commands and Python scripts. This is great, as you are not feeding massive text directly into a model’s context window or relying on semantic retrieval. Results: – On BrowseComp-Plus (750M tokens), this approach scores 88.5% vs 80% best published. – On Oolong-Real (385K tokens), 33.7% vs 24.1%, a 56% relative improvement. – GPT-5 full-context baseline only manages 20% on BrowseComp-Plus. Works up to 3 trillion tokens. Instead of scaling context windows or building retrieval pipelines, coding agents that already know how to navigate file systems can process virtually unlimited context. The agents autonomously develop task-specific strategies: writing scripts, iterative query refinement, and programmatic aggregation. Paper: arxiv.org/abs/2603.20432 Learn to build effective AI agents in our academy: academy.dair.ai/

    → View original post on X — @dair_ai, 2026-03-30 15:12 UTC

  • CAID: Multi-Agent Asynchronous Coordination for Software Engineering
    CAID: Multi-Agent Asynchronous Coordination for Software Engineering

    Effective strategies for asynchronous software engineering agents. elvis (@omarsar0) NEW research from CMU. (bookmark this one) The biggest unlock in coding agents is understanding strategies for how to run them asynchronously. Simply giving a single agent more iterations helps, but does not scale well. And multi-agent research shows that coordination > compute. A new paper from CMU proves this with a practical multi-agent system. CAID (Centralized Asynchronous Isolated Delegation) borrows proven human SWE practices: a manager builds a dependency graph, delegates tasks to engineer agents who work in isolated git worktrees, execute concurrently, self-verify with tests, and integrate via git merge. CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on the Python library development tasks (Commit0). The key insight is that isolation plus explicit integration beats both single-agent scaling and naive multi-agent approaches. For long-horizon software engineering tasks, multi-agent coordination using git-native primitives should be the default strategy, not a fallback. Paper: arxiv.org/abs/2603.21489 Learn to build effective AI agents in our academy: academy.dair.ai/ — https://nitter.net/omarsar0/status/2038627572108743001#m

    → View original post on X — @dair_ai, 2026-03-30 14:42 UTC

  • Reasoning Models: Why Listed Prices Don’t Match Actual Costs
    Reasoning Models: Why Listed Prices Don’t Match Actual Costs

    // When Cheaper Reasoning Models End Up Costing More // The model you think is cheaper might actually cost you more. New research quantifies exactly how misleading listed API prices are. Across 8 frontier reasoning models and 9 tasks, 21.8% of model-pair comparisons exhibit pricing reversal, where the cheaper-listed model costs more in practice. The magnitude reaches up to 28x. Gemini 3 Flash is listed 78% cheaper than GPT-5.2, yet its actual cost is 22% higher. Claude Opus 4.6 is listed at 2x Gemini 3.1 Pro but actually costs 35% less. The root cause: thinking token heterogeneity. On the same query, one model may use 900% more thinking tokens. Why does it matter? Anyone choosing reasoning models for production needs to benchmark actual costs, not listed prices. Removing thinking token costs reduces ranking reversals by 70%. The authors release code and data for per-task cost auditing. Paper: arxiv.org/abs/2603.23971 Learn to build effective AI agents in our academy: academy.dair.ai/

    → View original post on X — @dair_ai, 2026-03-29 15:07 UTC