I was actually thinking lately if there is any SOTA multimodal that is not using ViT as vision encoder.
SOTA Multimodal Models Without ViT Vision Encoders
By
–
By
–
I was actually thinking lately if there is any SOTA multimodal that is not using ViT as vision encoder.