Evaluation Crisis: MMLU Obsolete, Need Better AI Metrics

AI Dynamics

Global AI News Aggregator

Evaluation Crisis: MMLU Obsolete, Need Better AI Metrics

–

02 March 2025 19h29

My reaction is that there is an evaluation crisis. I don't really know what metrics to look at right now. MMLU was a good and useful for a few years but that's long over.
SWE-Bench Verified (real, practical, verified problems) I really like and is great but itself too narrow.

→ View original post on X — @karpathy,

2 March 2025

AI LLMS RESEARCH

AI Dynamics

Evaluation Crisis: MMLU Obsolete, Need Better AI Metrics

Commentaires

Leave a Reply Cancel reply

MORE ARTICLES

AI Generates Perfect Jokes Using Image Generation Skills

Codex App Transformation: Atlas Integration Reshapes User Experience

AI File Access Limitations: Screenshot vs Disk Storage Issues

Synthetic Aperture Radar: Satellite Tech for Global Monitoring