Extending LLMs from text to vision will probably take time but, interestingly, can be made incremental. E.g. Flamingo (
https://
storage.googleapis.com/deepmind-media
/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf
… (pdf)) processes both modalities simultaneously in one LLM.
Extending LLMs to Vision: Incremental Multimodal Integration with Flamingo
By
–
Leave a Reply