Oh yeah. I know the {0,1,N} shot tasks in LM harness and in the palm/gpt-3 evals are very similar modulo some prompting diffs. I don't exactly mean to say palm-evals are better than that. It was just referring to the academic tasks in general (not specifically LM harness). Im
Language Model Evaluation Tasks and Benchmarking Methodologies Comparison
By
–
Leave a Reply