Wow, this has just become my favorite LLM test. I missed that this doesn't work but it really doesn't, even for SOTA LLMs. Seems to be a bit hit and miss, e.g. with GPT4o which failed 1/3 times, Claude failed 3/3 times.
New favorite LLM test reveals inconsistent performance across SOTA models
By
–
Leave a Reply