AI Dynamics

Global AI News Aggregator

About

Interleaving Tokens and Training Data for Vision Models

These are good questions! Regarding 1, yes, interleaving text and image tokens makes a lot of sense then. For part 2, not sure but I suspect you'd need to train the model on data that contains bounding box coordinates at the very least.

→ View original post on X — @rasbt,