Fundamentally it's because ARC grids aren't images and thus VLMs can't make sense of them. They're 2D grids of tokens. Some people use 2D native transformers to process them, with good results (2D position embedding, or 2D attention), but a flattened sequence is actually a very
ARC Grids as Token Sequences: Why VLMs Struggle with Processing
By
–
Leave a Reply