AI Dynamics

Global AI News Aggregator

About

NVIDIA LocateAnything predicts entire box as atomic unit

Most VLMs predict bounding boxes one token at a time — X1, Y1, X2, Y2.
But a box isn't text. It's geometry.
NVIDIA's LocateAnything predicts the entire box as one atomic unit. Parallel Box Decoding > next-token prediction for spatial outputs.
(Part 1 ) Breakdown

→ View original post on X — @learnopencv