“Screening Is Enough” As Transformers struggle with long context because softmax attention only gives relative relevance, irrelevant tokens would still get weight and useful ones get diluted as context grows. The paper proposes Multiscreen where it judges each key independently, drop the irrelevant ones, and aggregate only what actually matters. This gives the model a better relevance knowledge, including the ability to know "nothing here is useful", which standard attention can’t do. Empirically, this gives roughly 40% fewer parameters for similar loss, much stronger long-context retrieval, and 2.3–3.2x faster 100K-context inference.
→ View original post on X — @askalphaxiv, 2026-04-03 18:23 UTC

Leave a Reply