Thanks. What I had in mind is: The control is eg the text prompt. Then one simulates forward with the video model. One verifies whether the desired goal was attained using eg VLM success detectors https://
arxiv.org/abs/2303.07280. The action text is an intermediate signal generated by a
Video Model Control Using Text Prompts and VLM Success Detection
By
–
Leave a Reply