Paper does test multi-turn (24 tool calls in OfficeQA, 30 turns in SpreadsheetBench, 50 steps in ALFWorld), but mid-session auto-compaction isn't part of the eval. Skill being under 2K tokens probably helps it survive compaction, but not validated against that failure mode.
Multi-Turn Agent Evaluation and Context Compaction Limits
By
–