Some interesting examples: In this image and a few others, CLIP appears to associate bare skin with "embarrassment". LLAVA and Captions + GPT don't, seeming to reason over the location and context.
CLIP Vision Model Bias: Skin Recognition and Context Understanding
By
–
Leave a Reply