3/ Inputs and outputs span text, image, video, audio AND action.
— Chubby♨️ (@kimmonismus) 1 juin 2026
That last one is the big deal. Cosmos 3 was trained natively to generate actions, so the same checkpoint can run as a vision-language model, a video world model, or a robot policy. No multi-model orchestration. pic.twitter.com/PoLsK33ytZ
3/ Inputs and outputs span text, image, video, audio AND action. That last one is the big deal. Cosmos 3 was trained natively to generate actions, so the same checkpoint can run as a vision-language model, a video world model, or a robot policy. No multi-model orchestration.