-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we explore refinements to the MinHash implementation? #1230
Comments
hi! thank you very much for posting these links! I haven't had a chance to read them thoroughly yet, but I wanted to drop a note in here to say that we are using A quick skim of the second paper suggests that this approach might also work for Scaled MinHash, but that's an uninformed opinion at this point :) This is also a topic that @dkoslicki might be interested in! |
I recently added a HyperLogLog to sourmash using the maximum-likelihood estimators from "New cardinality estimation algorithms for HyperLogLog sketches", which are also using the counts for estimating overlap (the "Joint MLE" in the paper). In the MinHash case, is the suggestion to use the hash abundances for estimating overlap?
(I need to read both papers more deeply, but so cool to see similar problems in other fields =]) |
I'm glad to hear there is openness on this. I do think both I suggested are very easy to implement. Once you get a chance to read either paper more closely let me know. |
Not exactly. The suggestion is to use eq. 12, 16 and 25 from this paper: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf. These formulas correct for over-saturation of the bloom vector. It should give you much more accurate estimates of the total overlap. |
Let me know if you need any information on this. It should be very easy to implement. |
There are two refinements I'd like to explore with sourmash, which might improve on the current MinHash implementation.
Count vectors can be used to estimate overlap between sets with more efficiency: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340911/
There is a fairly straight-forward correction that can applied to MinHash similarity that should improve overlap estimates: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf
I would like to know the interest level in pursuing these. I'm an academic, and interested in publishing. Genome comparisons are not my normal area. Perhaps a collaboration might be interesting to someone more in that field.
CTB note: updated links
The text was updated successfully, but these errors were encountered: