-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about implementation of double hashing #43
Comments
Does your question is more about the randomness property of having "random, uniform, and independent hash functions h1 and h2" before computing the double hashing? Or more focused on the fact we're rehashing |
This. (2*hashCount, actually). My understanding is that double-hashing lets you simulate k independent hash functions gᵢ(x) (1 ≤ i ≤ k) of some input x (to an output set of size m) with only two actual hash calculations, by taking gᵢ(x) = h₁ + i h₂ mod m (maybe plus some quadratic or cubic term as well). For example, you could simulate 10 independent hashes with only two actual hash calculations instead of ten. But in your implementation, you're simulating 10 independent hashes with 20 actual hash calculations. You're doing twice the amount of work than even an implementation that didn't use double hashing at all and just calculated one hash per index would be doing? I did an implementation (as a fork of bloomit which is a fork of bloom-filters) that just used two hash calculations and used double hashing to simulate all k, and it seems to work as expected: https://github.com/ably-forks/bloomit/blob/ae99007121121b28fbaab54f699658de5b249c4a/src/utils.ts#L113-L125 . But would be happy to be convinced I've missed something 🙂 |
Yes you're right! We should only hash once for getting our indexes. In the commit I referenced I fixed how indexes are generated and replace The point is Bloom Filters or equivalent don't need to set data in distinct indexes. In the different implementations we provide, only IBLTs require a set of distinct indexes. Just take a look at the new tests in bloom-filters/test/utils-test.js Lines 171 to 213 in 6e9cbcd
Even after 1M iterations we just generated 1 index out of the 7 desired. With a hash per iteration we make much less iterations for generating the 7 indexes. Your example with even numbers is exactly one of the problems we avoid with this tweak. BTW, for performance I suggest you to modify your implementation with the Lines 224 to 253 in 6e9cbcd
Can I ask why you're using farmhash instead of xxhashjs? I did not find any interesting benchmark. |
It's not a matter of "less iterations". You're getting into a cycle where h₂(x) == 0 mod 7, you could do 10^100 iterations and it would still never generate an index other than 4. (In theory anyway - in practice n can get high enough that n*h₂ starts approaching Number.MAX_SAFE_INTEGER and you start getting aliasing issues with putting integers in doubles, but that's hardly a good solution). The reason you're getting into a cycle is that you're adding 1 to the second hash (4098376680), which gives 4098376681, which is divisible by 7. This breaks the requirement for double hashing that h₂(x) must be coprime with the size. You can solve this with eg the enhanced double hashing algorithm described in the dillinger paper. (Which I've just realised I implemented wrongly in mine, I'll fix that).
Again, if you're doing to rehash on every iteration there's literally no point at all in doing double hashing. The only point of double hashing in this context is to avoid computing a new hash for every index, but you're doing that anyway.
Uninteresting reasons (we're already using farmhash for other things, so reusing the dependency). |
Tweaked my double hashing implementation: https://github.com/ably-forks/bloomit/blob/9cacea1cd1637b60be486dd15e8703b48b160aaa/src/utils.ts#L106-L141 (that's on the xxhash version, for a pr to bloomit). Properly implements enhanced double hashing, rehashes at most once every size iterations (which should be almost never, only on pathological inputs). |
Ahah 👍 we worked exactly on the same thing but in different ways, and I'm actually surprised by yours. I agree, this will almost never happen because the number of hash functions is not very high in practice. I mean, I never see someone setting a hashCount of 1000. But just in case it will work. |
Sure, go for it |
Hi. I'm just looking into the hashing implementation here and I'm baffled by several aspects of the implementation, hoping you can tell me if there's anything I'm missing.
For example, in getIndices, you're rehashing the input on every iteration of the loop:
bloom-filters/src/utils.ts
Lines 206 to 209 in 5c81fa4
Surely this defeats the point of double hashing, to simulate k independent hash functions given only two real ones? Not using double hashing at all would need to do one hash per index, your implementation does two hashes per index.
It's true that the hashes you're calculating on each loop aren't quite the same, because you're adding
size % i
-- that's the number of cells modulo the loop iteration -- to the seed each time. But why? That seems like a really strange thing to add to the seed. It doesn't guarantee that the seed is different on each loop (eg if the number of cells is even it'll be 0 for the first 2 iterations). But again that's not something you should want/need anyway in double hashing.Am I missing something?
The text was updated successfully, but these errors were encountered: