-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlowerFilter monoid #135
Comments
So if you need 10 hash functions over an array with capacity to store 100 elements:
To actually use the 10 hash functions on a key, say key = 34 The family of hashfunction algo : From pg 267, Corman & Leiserson ( Designing a universal class of hash functions ) |
Edit: p has to be prime, btw :) |
We've already done this three times in the code: min hash, count min sketch, and bloom filter. Here's one more approach: http://stackoverflow.com/questions/9553989/what-hashing-techniques-to-use-when-building-a-bloom-filter-in-clojure I don't think we need a new hash family. In fact, I'd argue we should only use one approach in this code. |
Agreed, we should have a HashFamily that gets used by all of these. BTW this is a variation of what @jmhodges calls "The opposite of a bloom filter", right? |
In that case, the 'val hashFunctions' implementation in min hash should suffice for my purposes. I will suggest benchmarking... the Leiserson algo is very,very cheap - generate a prime number once and integral mod twice. Whereas the implementation you have uses a random number gen plus other heavy machinery - maybe more robust but also more expensive. |
Do you feel aFlowerFilter.contains(item) should return a Boolean or an ApproximateBoolean like BloomFilter? Although it goes against the precedent, I believe it might be better to return a Boolean because:
Feedback appreciated. |
David, if we could get a correct bound on the Boolean, then we should do so. So, when it says true, the probabilty it is correct is 1. But when we say absent, what is the probability that another item has evicted that item? Can we derive such a bound? |
Yes sir, a slight modification to http://eng.42go.com/flower-filter-an-update/ for fingerprinting collisions yields the true probability. Two other things that I missed:
|
http://eng.42go.com/a-simple-time-decaying-approximate-membership-filter/
This stores each item T in each of it's hash bucket indices (so it is considerably larger than a normal bloom filter), but it considers something present if t: T exists in one of the hashed locations.
This is interesting because it can only remember recent items, which is often useful for some streaming join, or some decayed approximation algorithm.
The text was updated successfully, but these errors were encountered: