Another example of a persistent problem with LLMs. They do very well on standard medical questions, but when the right answer is replaced with “none of the above” performance drops. More recent models generally have lower drops in performance. https://
jamanetwork.com/journals/jaman
etworkopen/fullarticle/2837372
…
LLM Performance Drops on Non-Standard Medical Questions
By
–
