Thinking about this a bit more, I think nearest-neighbor on normalized bag-of-words vectors would probably also perform well here. Because the count vectors of two similar documents are similar just like the compressed + concatenated compressed docs are similar.
Nearest-Neighbor Approach on Normalized Bag-of-Words Vectors
By
–
Leave a Reply