Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we explore refinements to the MinHash implementation? #1230

Open
swamidass opened this issue Nov 2, 2020 · 5 comments
Open

Can we explore refinements to the MinHash implementation? #1230

swamidass opened this issue Nov 2, 2020 · 5 comments

Comments

@swamidass
Copy link
Contributor

swamidass commented Nov 2, 2020

There are two refinements I'd like to explore with sourmash, which might improve on the current MinHash implementation.

  1. Count vectors can be used to estimate overlap between sets with more efficiency: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340911/

  2. There is a fairly straight-forward correction that can applied to MinHash similarity that should improve overlap estimates: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf

I would like to know the interest level in pursuing these. I'm an academic, and interested in publishing. Genome comparisons are not my normal area. Perhaps a collaboration might be interesting to someone more in that field.

CTB note: updated links

https://pubs.acs.org/doi/abs/10.1021/ci600526a
https://pubmed.ncbi.nlm.nih.gov/25714898/

@ctb
Copy link
Contributor

ctb commented Nov 2, 2020

hi! thank you very much for posting these links!

I haven't had a chance to read them thoroughly yet, but I wanted to drop a note in here to say that we are using
a non-standard MinHash for most of our sourmash work - Scaled MinHashes. This is discussed at length in @luizirber PhD thesis, and is also the subject of a (still draft) paper.

A quick skim of the second paper suggests that this approach might also work for Scaled MinHash, but that's an uninformed opinion at this point :)

This is also a topic that @dkoslicki might be interested in!

@luizirber
Copy link
Member

There are two refinements I'd like to explore with sourmash, which might improve on the current MinHash implementation.

1. Count vectors can be used to estimate overlap between sets with more efficiency: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340911/

I recently added a HyperLogLog to sourmash using the maximum-likelihood estimators from "New cardinality estimation algorithms for HyperLogLog sketches", which are also using the counts for estimating overlap (the "Joint MLE" in the paper).

In the MinHash case, is the suggestion to use the hash abundances for estimating overlap?

2. There is a fairly straight-forward correction that can applied to MinHash similarity that should improve overlap estimates: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf

I would like to know the interest level in pursuing these. I'm an academic, and interested in publishing. Genome comparisons are not my normal area. Perhaps a collaboration might be interesting to someone more in that field.

(I need to read both papers more deeply, but so cool to see similar problems in other fields =])

@swamidass
Copy link
Contributor Author

I'm glad to hear there is openness on this. I do think both I suggested are very easy to implement. Once you get a chance to read either paper more closely let me know.

@swamidass
Copy link
Contributor Author

In the MinHash case, is the suggestion to use the hash abundances for estimating overlap?

Not exactly. The suggestion is to use eq. 12, 16 and 25 from this paper: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf.

These formulas correct for over-saturation of the bloom vector. It should give you much more accurate estimates of the total overlap.

@ctb ctb changed the title Can we explore refinements to the signatures? Can we explore refinements to the MinHash implementation? Mar 4, 2021
@swamidass
Copy link
Contributor Author

Let me know if you need any information on this. It should be very easy to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants