These are good questions! Regarding 1, yes, interleaving text and image tokens makes a lot of sense then. For part 2, not sure but I suspect you'd need to train the model on data that contains bounding box coordinates at the very least.
Interleaving Tokens and Training Data for Vision Models
By
–