(3) Even just looking at BIG-Bench metrics is quite understudied IMO. There are hundreds of tasks in BIG-Bench, and each task has dozens of models evaluated, each with many evaluation metrics. There are task logs for some models. This raises natural questions:
BIG-Bench metrics deserve deeper analysis and study
By
–
Leave a Reply