Often, evals are very disconnected from actual utility. For example, we had an eval for a while that measured 'writing style'. Basically, how well do we prevent AI slop in writing output? We maxed out the eval, put the model in prod, and users hated it.
Leave a Reply