Thinking about this a bit more, I think nearest-neighbor on normalized bag-of-words vectors would probably also perform well here. Because the count vectors of two similar documents are similar just like the compressed + concatenated compressed docs are similar.