Most Unified Multimodal Models still generate images through a frozen VAE, which means perception and generation are not fully learned in one model. This paper fixes this by making the decoder first predict
Representation Forcing for Bottleneck-Free Unified Multimodal Models
By
–
