AI Dynamics

Global AI News Aggregator

@askalphaxiv

  • Test-Time Scaling Makes Overtraining Compute-Optimal
    Test-Time Scaling Makes Overtraining Compute-Optimal

    Most scaling laws assume you train once and answer once. But this paper says that if you already know you'll spend extra compute at test time by sampling many answers, then you should train a different model. So instead of a bigger model trained the usual way, it can be better to train a smaller model for much longer. As smaller models are cheaper to sample many times, those extra tries can beat one expensive shot from a larger model. So the real thing to optimize is not just training compute, but training + inference together. And this paper shows overtraining can actually become the compute-optimal choice. [Translated from EN to English]

    → View original post on X — @askalphaxiv, 2026-04-06 18:08 UTC

  • Meta Harnesses: Automated Framework Optimization for AI Tasks
    Meta Harnesses: Automated Framework Optimization for AI Tasks

    Meta Harnesses is Autoresearch on steroids. Something I've been exploring recently is to get long running agents to hill climb on a verifiable task to continuously improve without my intervention. Karpathy's Autoresearch did this pretty well on specific tasks, but this weekend I tried Meta Harnesses which moves one level of abstraction up. What does Meta Harness do? Autoresearch can be used in harness like Claude Code / Codex to generate experiments to try, evaluate results, and continue looping. Meta Harness generates a harness itself that optimizes on a task or a set of task. Here, we define a harness as "a single-file Python program that modifies task-specific prompting, retrieval, memory, and orchestration logic". The idea is that LLMs are very powerful today, but to harness [pun intended] their power, you need to give it the right prompts and context. Meta Harnesses automates coming up with the right prompts and the right way to retrieve context to solve a problem. Where did this idea come from? This is from a paper from Stanford and the author of DSPy written last week. The paper shows fantastic performance on 3 tasks: text classification, math reasoning (IMO level problems) and coding (Terminal Bench 2.0), far outperforming traditional harnesses. The discovered harnesses are interesting: math for example, splits up the logic into different categories (Combinatorics, Geometry, Number Theory, Algebra) and prompts and looks at the context differently. The coding harness, amongst other things, pre-processes the tools available in the environment to save exploratory turns. When should you use and not use it? Meta Harnesses seem pretty useful for tackling a specific but wide set of problems where the result is verifiable. In contrast, when I tried it on a specific task like Chess, it arbitrarily divides the problem into separate tasks – opening, mid game, end game, and creates different approaches for each. This "works" but isn't really clean because we believe there should be one approach that does all three. It does far better on things like examinations (JEE, Gaokao) where it splits problems into categories and tackles each category with different strategies. This paper covers a pretty light version of what a harness means. In the future, we can split up tasks into harnesses that have access to specific kinds of data, specific toolchains and various models to get even better results. Overall, pretty cool applied AI approach to hillclimb a verifiable task in a specific domain with variety within the problem space.

    → View original post on X — @askalphaxiv, 2026-04-06 16:22 UTC

  • HISA: Hierarchical Indexing for Efficient Sparse Attention in LLMs
    HISA: Hierarchical Indexing for Efficient Sparse Attention in LLMs

    "HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention" Sparse attention can still be slow. And the slow part is often not the attention step itself, but the search step that scans the whole context to find useful tokens. This paper's HISA makes that search cheaper. It first finds the best blocks, then finds the best tokens inside those blocks. This keeps token-level precision, needs no retraining, works with the same downstream attention, and gives up to 3.75x speedup while staying close to the original quality.

    → View original post on X — @askalphaxiv, 2026-04-05 19:06 UTC

  • Read more: alphaxiv.org/abs/2603.28458

    read more:
    alphaxiv.org/abs/2603.28458 [Translated from EN to English]

    → View original post on X — @askalphaxiv, 2026-04-05 19:06 UTC

  • Embarrassingly Simple Self-Distillation Improves Code Generation
    Embarrassingly Simple Self-Distillation Improves Code Generation

    Most code-improvement methods need extra heuristic work, either a stronger teacher, an execution-based filtering, or RL. But this paper shows that the model can get much better at code generation by training on its own raw outputs, even though those outputs are not verified, not filtered for correctness, and not produced by a stronger teacher. What makes it even more surprising is that the gains are largest on harder problems, and simple decoding tricks alone cannot recover them. The most counterintuitive result is that it can still help even when the self-generated training data is partly garbage. The paper shows a high-temperature setting where many outputs become gibberish, yet the fine-tuned model still improves materially. That suggests the benefit is not mainly coming from learning correct programs, but from reshaping the model's token probabilities in a better way. Empirically, Qwen3-30B-Instruct improves from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with the biggest gains on harder problems. [Translated from EN to English]

    → View original post on X — @askalphaxiv, 2026-04-05 18:25 UTC

  • Read more: alphaxiv.org/abs/2604.01193

    read more:
    alphaxiv.org/abs/2604.01193 [Translated from EN to English]

    → View original post on X — @askalphaxiv, 2026-04-05 18:25 UTC

  • Read more: alphaxiv.org/abs/2603.18886

    read more:
    alphaxiv.org/abs/2603.18886 [Translated from EN to English]

    → View original post on X — @askalphaxiv, 2026-04-04 17:55 UTC

  • Principia: Mathematical Object Reasoning Benchmark for Frontier Models
    Principia: Mathematical Object Reasoning Benchmark for Frontier Models

    “Reasoning over Mathematical Objects” Most reasoning benchmarks still let models answer with multiple choice or short numerics, which makes evaluation easy but also makes the task easier than real STEM reasoning. This paper shows that when you remove the options and ask for the actual object, like an equation, matrix, set, interval, or piecewise function, the performance drops sharply, even for frontier models. So this paper proposes Principia: a benchmark, training set, and verifier pipeline built specifically for mathematical-object reasoning, plus on-policy judge training to score these hard outputs reliably. What makes this interesting is that training on these harder outputs also improves standard math and science benchmarks, suggesting this is not just better formatting, but better actual reasoning.

    → View original post on X — @askalphaxiv, 2026-04-04 17:55 UTC

  • Read more on alphaxiv.org/abs/2603.14389

    read more:
    alphaxiv.org/abs/2603.14389 [Translated from EN to English]

    → View original post on X — @askalphaxiv, 2026-04-04 02:27 UTC

  • DGPO: Stable Soft-Clipping via Bilateral Decoupled Probability Gradient Decay
    DGPO: Stable Soft-Clipping via Bilateral Decoupled Probability Gradient Decay

    "From log π to π: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight" This paper argues that RLVR should optimize probability gradients, not log-probability gradients. So they proposed, DGPO, which applies decoupled decay at the clipping boundaries to preserve exploration while preventing gradient divergence. This results in a more stable soft-clipping objective that consistently outperforms GRPO and prior variants on mathematical reasoning benchmarks.

    → View original post on X — @askalphaxiv, 2026-04-04 02:27 UTC