Humans can see in high-res, high-FPS in real-time. Why can't VLMs?
— Baifeng (@baifeng_shi) 24 mars 2026
Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos.
๐ https://t.co/a14yqNRPlh
๐ https://t.co/ifLNMUIL3J
๐คโฆ pic.twitter.com/O0A0WLrkxb
Humans can see in high-res, high-FPS in real-time. Why can't VLMs? Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos. ๐ arxiv.org/abs/2603.12254 ๐ autogaze.github.io ๐ค huggingface.co/collections/bโฆ (1/n)๐งต
โ View original post on X โ @berkeley_ai, 2026-03-24 18:46 UTC
Leave a Reply