Can we explore refinements to the MinHash implementation? #1230

swamidass · 2020-11-02T03:41:45Z

There are two refinements I'd like to explore with sourmash, which might improve on the current MinHash implementation.

Count vectors can be used to estimate overlap between sets with more efficiency: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340911/
There is a fairly straight-forward correction that can applied to MinHash similarity that should improve overlap estimates: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf

I would like to know the interest level in pursuing these. I'm an academic, and interested in publishing. Genome comparisons are not my normal area. Perhaps a collaboration might be interesting to someone more in that field.

CTB note: updated links

https://pubs.acs.org/doi/abs/10.1021/ci600526a
https://pubmed.ncbi.nlm.nih.gov/25714898/

ctb · 2020-11-02T14:10:51Z

hi! thank you very much for posting these links!

I haven't had a chance to read them thoroughly yet, but I wanted to drop a note in here to say that we are using
a non-standard MinHash for most of our sourmash work - Scaled MinHashes. This is discussed at length in @luizirber PhD thesis, and is also the subject of a (still draft) paper.

A quick skim of the second paper suggests that this approach might also work for Scaled MinHash, but that's an uninformed opinion at this point :)

This is also a topic that @dkoslicki might be interested in!

luizirber · 2020-11-03T16:26:39Z

There are two refinements I'd like to explore with sourmash, which might improve on the current MinHash implementation.
1. Count vectors can be used to estimate overlap between sets with more efficiency: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340911/

I recently added a HyperLogLog to sourmash using the maximum-likelihood estimators from "New cardinality estimation algorithms for HyperLogLog sketches", which are also using the counts for estimating overlap (the "Joint MLE" in the paper).

In the MinHash case, is the suggestion to use the hash abundances for estimating overlap?

2. There is a fairly straight-forward correction that can applied to MinHash similarity that should improve overlap estimates: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf
I would like to know the interest level in pursuing these. I'm an academic, and interested in publishing. Genome comparisons are not my normal area. Perhaps a collaboration might be interesting to someone more in that field.

(I need to read both papers more deeply, but so cool to see similar problems in other fields =])

swamidass · 2020-11-07T18:00:49Z

I'm glad to hear there is openness on this. I do think both I suggested are very easy to implement. Once you get a chance to read either paper more closely let me know.

swamidass · 2020-11-07T18:05:02Z

In the MinHash case, is the suggestion to use the hash abundances for estimating overlap?

Not exactly. The suggestion is to use eq. 12, 16 and 25 from this paper: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf.

These formulas correct for over-saturation of the bloom vector. It should give you much more accurate estimates of the total overlap.

swamidass · 2021-03-08T21:12:30Z

Let me know if you need any information on this. It should be very easy to implement.

ctb mentioned this issue Mar 3, 2021

Update author list for 4.0 #622

Closed

ctb changed the title ~~Can we explore refinements to the signatures?~~ Can we explore refinements to the MinHash implementation? Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we explore refinements to the MinHash implementation? #1230

Can we explore refinements to the MinHash implementation? #1230

swamidass commented Nov 2, 2020 •

edited by ctb

Loading

ctb commented Nov 2, 2020

luizirber commented Nov 3, 2020

swamidass commented Nov 7, 2020

swamidass commented Nov 7, 2020

swamidass commented Mar 8, 2021

Can we explore refinements to the MinHash implementation? #1230

Can we explore refinements to the MinHash implementation? #1230

Comments

swamidass commented Nov 2, 2020 • edited by ctb Loading

CTB note: updated links

ctb commented Nov 2, 2020

luizirber commented Nov 3, 2020

swamidass commented Nov 7, 2020

swamidass commented Nov 7, 2020

swamidass commented Mar 8, 2021

swamidass commented Nov 2, 2020 •

edited by ctb

Loading