Good point. Btw, you also see a similar leakage trap in recommendation datasets. Most recsys data is built around users interacting with items over time, which means the same user appears dozens or hundreds of times. If you do a plain random split, you accidentally leak the
Data Leakage in Recommendation Systems from User Interaction Patterns
By
–
Leave a Reply