Multimodal machine learning is a hot area in AI research. Unimodal learning has developed massively in the last 5 years. The challenge now is how we fuse different modalities(vision, audio, text, robot actions) into a single agent. GPT-4 & similar models are the beginning.
Multimodal Machine Learning: Fusing Vision, Audio, Text, and Actions
By
–
Leave a Reply