An AI can tell you there's a cat in the image. Pointing to the exact pixels is the hard part.
— Satya Mallick (@LearnOpenCV) 2 juin 2026
The reason it's slow: most VLMs spell out a bounding box one coordinate token at a time — some even split "1024" into single digits. But a box's corners are connected. Decode them… pic.twitter.com/eoJu0PiHGU
An AI can tell you there's a cat in the image. Pointing to the exact pixels is the hard part.
The reason it's slow: most VLMs spell out a bounding box one coordinate token at a time — some even split "1024" into single digits. But a box's corners are connected. Decode them