AI Dynamics

Global AI News Aggregator

Decoder-Only Architectures for Vision-Language Models

That makes sense. But you could also use a decoder-only architecture (with embedded image tokens as part of the input, as in LLaMA-Adapter, for example). (* This uses an encoder for the tokens, but it's still a decoder-only architecture due to the lack of cross-attention)

→ View original post on X — @rasbt,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *