That makes sense. But you could also use a decoder-only architecture (with embedded image tokens as part of the input, as in LLaMA-Adapter, for example). (* This uses an encoder for the tokens, but it's still a decoder-only architecture due to the lack of cross-attention)
Decoder-Only Architectures for Vision-Language Models
By
–
Leave a Reply