Most VLMs predict bounding boxes one token at a time — X1, Y1, X2, Y2.
— Satya Mallick (@LearnOpenCV) 5 juin 2026
But a box isn't text. It's geometry.
NVIDIA's LocateAnything predicts the entire box as one atomic unit. Parallel Box Decoding > next-token prediction for spatial outputs.
(Part 1 🧵) Breakdown 👇… pic.twitter.com/mQ83uMqtdv
Most VLMs predict bounding boxes one token at a time — X1, Y1, X2, Y2.
But a box isn't text. It's geometry.
NVIDIA's LocateAnything predicts the entire box as one atomic unit. Parallel Box Decoding > next-token prediction for spatial outputs.
(Part 1 ) Breakdown