Can vision-language models truly see the fine-grained details in images? Google DeepMind presents TIPSv2. They boost dense patch-text alignment using three novel tricks: a distillation method where the student outperforms the teacher, an upgraded masked image objective
Google DeepMind TIPSv2 Boosts Vision-Language Dense Patch-Text Alignment
By
–
