-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Has bloom filters a higher probability of collide when strings are similar #48
Comments
if not, how can I replace the Hash function used with another one more permissive |
Hi 👋
Under the hood, we are using xxHash. Their website contains a full benchmark, including collision probabilities. You should be able to find all information you need there.
You can change the seed used by the hash functions, but I'm afraid we do not have an API for swapping the hash function. The reason is that all the data-structures implemented by our package are carefully designed around xxHash. However, you can override any classes and their methods to customize how values are hashed, but we don't provide support for this, as it is a very advanced usage of the library. |
Hello @Callidon Ok then I will not mess with the hash function itself. I will pre-run a similarity detection on my stream (which is a stream of strings). |
In newer version (soon I hope) you'll have the possibility to override a function for using any custom hash functions as long as you return a value of type |
Thanks @folkvir , I'll keep an eye. for now, I will be running a similarity String prior to feeding topK. Very close Strings would be converted to the same String. Many thanks again! |
Author: Arnausd Grall (@folkvir) * update deps and add gts * h64 only + fix #43 + gts * mocha: parallel execution * utils: rename *Indices to *Indexes, add better documentation * tests: show difference when using getDistinctIndexes w/wo hashing at each iteration * utils: rename *Indices to *Indexes * update ref pdf of CMS * double hashing: use the enhanced technique * getDistinctIndexes: implement #43 suggestions + allow to switch to XXH 32/64 if needed * update changelog * add .DS_Store in the list of ignored files * use typescript 4.X.X * fix typo of #46 * fix error when setting 32/64 bits xxh functions, and allow for overriding the serialize function (#48) * add code coverage * lint * update eslint ignore for faster parsing * add branch update_outdated to the tested branches * fix wrong github workflow branch name * workflow: use 16.x and remove 10.x * update README and remove useless deps * tests: fix error with describe/it rules overriding data * implemnting xor filter for the next release * update all files to the new project standard, set v2.0.0 * xor filters are working * testing the xor filter * xor filter #29 * #29 * documentation * code ql / tests on develop * keywords * fix workflow * move all hash related functions to the BaseFilter class * modify tests according to: 12417e7 * parallel mocha tests * use lts/* with setup-node@v2 * update README.md * add badge to the readme * export BaseFilter in the entry * badge refers to the last action build * update compatibility readme * update @types/node version to be the latest minor release * prettier auto line endings * update @types/node version to be the latest minor release * #49 add compatibility with 1.3.4 * #49 compatibility for BloomFilters import only is working * Use AutoExportable feature for simplfying the export process, bitsets size set to a multiple of bitwords * create Hashing classes * remove nyc, use eslint + prettier instead of gts, enforce type checking, make all propeties public, no readonly * update yarn.lock * ci: remove node 15 * fix missing dependency and fix eslint errors * update readme * fix conflicts with master, eslint is now a rule * add typedoc-plugin-missing-exports and fix non-overrided doubleHashing function
Because if it is the case, this really suits my need.
I'm trying to discover popular searches on a website. For this, I'm using TopK algorithm which is based on Bloom Filter hashing.
I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.
Thanks a lot @Callidon !!!
The text was updated successfully, but these errors were encountered: