Highlights from recent evaluations (insurance underwriting & more): Surprising error modes in complex reasoning Trade-offs between tool use & efficiency Beyond accuracy: deeper evaluation with Snorkel Evaluate Full leaderboards →
Complex Reasoning Error Modes in AI Model Evaluations
By
–
Leave a Reply