In this example, we believe CLIP sees the "surprised cat pose" and predicts doubt, surprise and fear, ignoring context. Oddly, LLaVA has also gone a bit far, inferring the person in this image is experiencing sadness because it's their last ski trip of the season.
CLIP and LLaVA Struggle with Contextual Image Interpretation
By
–
Leave a Reply