How to Use MinHash and MinHashLSH to Identify Comprehensive Documents and Partial Matches? #216

aplmikex · 2023-08-16T08:55:59Z

If I have one document, and another document contains a portion of this document along with some text like advertisements, but only a small amount, can MinHash retain only the most comprehensive document? Can MinHashLSH identify partial documents as accurately as possible if such documents are distributed among a large number of files?

ekzhu · 2023-08-17T19:23:41Z

Partial document overlap detection using MinHash is tricky. Effectively you are detecting containment rather than Jaccard similarity. Containment = |A \intersect B| / |A|. What you can do is either

use a larger minhash (e.g., num_perm > 256) to allow the overlap signal to show up in the Jaccard estimate, and then convert the estimated Jaccard similarity to containment using inclusion-exclusion principal.
chunk the documents into smaller chunks and build MinHash for every chunk. By computing all pairs of MinHashes from the two documents, hopefully there will be a pair give you high Jaccard similarity signal. So, this is effectively looking for max(Jaccard(m_i, m_j)) where i \in doc_1, j \in doc_2.

aplmikex · 2023-08-18T04:44:43Z

Thank you very much for your response. Your insights have been incredibly helpful. I had indeed considered the second point you mentioned, but segmenting the data doesn't always guarantee a perfect match between corresponding sections in the two documents. How do you view this aspect? However, performing pairwise comparisons is relatively straightforward. There are some alternative methods for achieving pairwise comparisons as well. Unfortunately, with my dataset reaching up to a hundred thousand entries or even more, the O(n^2) complexity leads to significant time wastage. I'm feeling a bit stuck on this issue and would greatly appreciate your guidance and suggestions.

ekzhu · 2023-08-30T22:56:07Z

LSH exists to avoid the O(n^2) pair-wise comparison. I think you can use MinHash LSH to index all segments' minhash, and then self-query using every segment's minhash as query. This helps to approximate the pair-wise comparison you mentioned.

You can have some overlap between adjacent segments. Have you tried that?

ekzhu added the question label Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Use MinHash and MinHashLSH to Identify Comprehensive Documents and Partial Matches? #216

How to Use MinHash and MinHashLSH to Identify Comprehensive Documents and Partial Matches? #216

aplmikex commented Aug 16, 2023

ekzhu commented Aug 17, 2023

aplmikex commented Aug 18, 2023

ekzhu commented Aug 30, 2023

How to Use MinHash and MinHashLSH to Identify Comprehensive Documents and Partial Matches? #216

How to Use MinHash and MinHashLSH to Identify Comprehensive Documents and Partial Matches? #216

Comments

aplmikex commented Aug 16, 2023

ekzhu commented Aug 17, 2023

aplmikex commented Aug 18, 2023

ekzhu commented Aug 30, 2023