-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CF.SIZEFOR #4
Comments
Hi shahasim! Thanks for taking the time to write all the details. In the meantime, please double check this:
This command will create a 4G filter with 1-byte wide fingerprints. To create a filter with 4byte FPs:
Please let me know if that was a bug in your code or if you just forgot to add the last parameter when writing this issue. I'll let you know about CF.CAPACITY once I take a look. Ciao! |
Thanks for getting back to me on this; I've checked the code, I am definitely adding the FPSize at the end of the init command, I just missed it on the issue. For my understanding please, if I miss the FPsize at the end it should still create a filter with 1 billion elements, the only issue will be that I will get more collisions? However my problem is that I am getting "ERR too full" after just 16000 entires :( I have also tried to create a filter with 8G size but still no luck. On that though, as per the documentation the size can be anything between 1K to 8G, is 8G inclusive, as in can I create filter with 8G size or does it have to be less than 8G? Thanks, |
So, I found the bug with CF.CAPACITY. It was in the README. Long story short, the documentation was out of sync with what the code intended to do (probably an early draft that I forgot to update once I changed direction in the code). The real signature for the command is the following:
It's more or less the inverse of CF.SIZEFOR, and as such it doesn't involve any key.
A full explanation is a bit long (I tried to write it, then decided it's best to avoid throwing at you a giant wall of text), but you can test manually what happens with the filter.
So, going back to your comment, in Cuckoo adding the same item twice is not a noop like with Bloom. Cuckoo filters are organized in buckets that can get completely full. The filter is able to rearrange some fingerprints when that happens (hence the name Cuckoo), but that mechanism relies on the fingerprints not being all equal. The previous code is basically an artificial way to fill a section of the filter without giving it a chance to "rebalance" the fingerprints.
It's a bit embarrassing but it seems you have stumbled upon two dumb mistakes on my part. After the outdated docs, now you have discovered that I uploaded an obsolete .so/.dylib as part of the latest release! I tried to reproduce your problem with 8G filters by compiling the code on my machine and I couldn't reproduce it. Then I downloaded the latest release and there was the problem. I also suspect that the collision problem you're having is related to the fact that the release erroneously contains a build based on outdated code. I just released v1.1.1, would you mind trying it out and seeing if you still have trouble with either 8G filters or collisions? If you still have problems with collisions, please share your test routine so I can see what's going on. Thanks! |
Thank you for the quick turnaround on this and detailed explanation, turns out I was also ignoring a key concept re cuckoo filters. I thought the "ERR too full" was only because I was running out of memory/capacity, however it was partly to do with collisions and race conditions in my code :-( I had chosen FP size of 4bytes to reduce false positives, but this now means I must have fewer collisions compared to the other FPsizes. I have downloaded the latest release (many thanks :-)) and have changed my code to handle race conditions. I will stress test my code over the next couple of days and keep you posted re the progress. Thanks again :-) |
Sounds great. If you want a suggestion for your testing suite:
This way you split the slow part of the testing in a preprocessing stage that you don't need to run multiple times unless you want to generate different test data. Ciao! |
@shahasim any update? |
@kristoff-it , apologies for not responding to you earlier, due of the lack of time we went with this https://github.com/RedisBloom/RedisBloom implementation of the cuckoo filter. Mainly because of the function CF.ADDNX, add only if the item does not exist. This helped us reduce the engineering effort. It would be amazing if we could have something similar in your implementation. |
Hi @shahasim, no worries :) Good that you found a way to move forward. I won't implement ADDNX for this library because I believe it's easy to misuse it as ADDNX can't work (correctly) in conjunction with REMOVE. For example if you have A1 and A2, two items that collide, this sequence of operations will break the filter:
This behavior is unavoidable given the way probabilistic data structures work. If you are fine with introducing false negatives in the filter, then you're absolutely welcome to do so, but I believe the average users prefers to be protected from such situations. Note how using ADD would have avoided this problem completely. In fact it's possible to emulate ADDNX also with my library, it's just a combination of CHECK + a conditional ADD, which can be implemented efficiently in a Lua script, but I believe it's a suitable behavior only for extremely rare use cases. |
I agree with what you are saying about strings being similar and causing false positives. But I still think the ADDNX functionality is good to have with of course the caution and caveats that you mentioned. I still think your library provides a lot of flexibility and ADDNX would add more to it but I'll leave that decision to you :) I wish I had more time to implement it using lua |
Hi,
I wanted to understand the commands associated with the cuckoo filter, apologies some of these question might be very basic.
I am trying to create a filter that would hold 1 billion elements and an FP size of 4.
I have followed the following set of commands to estimate the size and create the filter
I believe the CF.Capacity should return 1 billion, however I get bad size error.
Similarly, when I start loading the values into my test filter, I get "ERR too full", after 12,000 to 16,000 values. So I am interested to know if there is a limitation on the size of the hash and that size of the hash string should be factored in. I am using 64-bit FNV hash to hash strings into numbers
I am using the following setup;
redis_version:5.0.5
programming language: golang
redis module used: go-redis
cuckoo filter version: libredis-cuckoofilter.so.1.1.0
OS: Red Hat Enterprise Linux Server release 7.6 (Maipo)
Please let me know if you need more info.
Thanks,
The text was updated successfully, but these errors were encountered: