Users don't give a shit how well you do on evals. They care how well it solves their problems. You can often get evals to *somewhat* correlate with user problems, but coverage is usually mid at best and it's a moving target. We've seen far better success w/ A/B tests against $.
User satisfaction matters more than AI evaluation benchmarks
By
–
Leave a Reply