On evaluating multi-step scientific tool use in LLM agents. SciAgentGym provides an interactive environment with 1,780 specialized tools across 4 scientific disciplines. The core finding: even advanced models like GPT-5 see success rates drop sharply from 60.6% to 30.9% as
LLM Agents Struggle With Multi-Step Scientific Tool Use
By
–
