This is the correct take. Evals are helpful but not well-correlated with actual utility. At Otherside, we use A/B tests grounded in real-world traffic, measured against subscriptions and retention. We've tried it all. This is the way.
Real-world A/B testing better than evals for AI utility
By
–
Leave a Reply