This is a new benchmark (ToolComp) that encompasses a broader range of Tool Use scenarios than prior benchmarks. Uniquely, we are further evaluating the models utilizing process supervision labels. We also split the benchmark into Enterprise and Chat use cases to differentiate
ToolComp: New Tool Use Benchmark with Process Supervision
By
–
