Fair, it's all definitions 🙂 But evals in the sense that most product builders think of them (static offline envs/test sets you run an agent version against) tend not to correlate super well to real-world success
Agent Evals Offline Tests Real World Success Correlation
By
–