Internally it appears to create temporary “helper images” (presumably using code here) and then uses those outputs as multimodal input for the final generation.
Using temporary helper images as multimodal input for generation
By
–
By
–
Internally it appears to create temporary “helper images” (presumably using code here) and then uses those outputs as multimodal input for the final generation.