Which models would you recommend for longer context tool calling? Are there any benchmarks for that which you find credible? I've not found a local model with tool calling good enough for me to trust with Claude Code or Codex, but I may not have been looking at the right options
Evaluating LLMs for Long-Context Tool Calling and Agentic Reliability
By
–