You're looking for complexity where there is none. "AI" does not go out in the wild to scrape data, programmers implement code that does it, and thus can & should track the source and copyrights. If there is no clear & usable copyright information, then the code drops it…