Also weird there’s no obvious transfer learning back to text. For all the ineffable, AGI-essential knowledge supposedly in images, multimodal models seem no better at spatial reasoning word problems, creating SVGs, designing web UI, or drawing ASCII art.
Multimodal models struggle with transfer learning to text despite image knowledge.
By
–