A quick comparison b/w Claude 3 and GPT-4 on a spatial reasoning task (n=100, 5 run average w/ temp=1.0). Seems like Claude 3 still beats GPT-4, and gpt-4-turbo performs worse than gpt-4-0613. Interesting contrast to their perf in chat & coding, where GPT-4 comes out ahead. 1/n
