2. Building evaluations. Many benchmarks get quickly saturated, and we need more to evaluate the frontier of language models. In addition, it’s still an open question of how to evaluate language models generally. The new OpenAI evals library could be good: https://
github.com/openai/evals
Building Evaluations for Frontier Language Models
By
–
Leave a Reply