Interestingly the native and most general medium of existing infrastructure wrt I/O are screens and keyboard/mouse/touch. But pixels are computationally intractable atm, relatively speaking. So it's faster to adapt (textify/compress) the most useful ones so LLMs can act over them
Why LLMs Process Text Instead of Raw Pixels
By
–
Leave a Reply