That's a great question and will take the opportunity to dash off a bit on the data mixing. Mixing data is a tricky balance it turns out. There were two main factors at play: – we wanted to keep general vision-language skills.
– and had unbalanced regions and languages: think
Data Mixing Balance: Vision-Language Skills and Regional Imbalance
By
–