Maybe it's because videos data contain a moderate amount of intermediate reasoning steps. For a natural/unedited video, everything is raw, live, and corrections are versioned while texts(on the web) tends to be the final version people publishes.
Video Data Contains More Intermediate Reasoning Than Published Text
By
–