"Let ViT Speak: Generative Language-Image Pre-training" Instead of contrastive image-text matching or adding a separate text decoder, this paper, GenLIP, trains a ViT to directly predict caption tokens from image patches with a standard next-token loss. A key component they use
GenLIP Trains ViT to Directly Predict Caption Tokens
By
–
