I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc. Sure, models learn to deal with that but it requires lots of data to make them robust wrt these.
Visual Tokenization Challenges: Aspect Ratios and Image Preprocessing Complexity
By
–
Leave a Reply