[Question] Has bloom filters a higher probability of collide when strings are similar #48

ghost · 2021-12-12T12:41:44Z

Because if it is the case, this really suits my need.

I'm trying to discover popular searches on a website. For this, I'm using TopK algorithm which is based on Bloom Filter hashing.

I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.

Thanks a lot @Callidon !!!

ghost · 2021-12-12T13:15:20Z

if not, how can I replace the Hash function used with another one more permissive

Callidon · 2021-12-13T07:32:15Z

Hi 👋
Thank you for your interest in our library!

I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.

Under the hood, we are using xxHash. Their website contains a full benchmark, including collision probabilities. You should be able to find all information you need there.

if not, how can I replace the Hash function used with another one more permissive

You can change the seed used by the hash functions, but I'm afraid we do not have an API for swapping the hash function. The reason is that all the data-structures implemented by our package are carefully designed around xxHash. However, you can override any classes and their methods to customize how values are hashed, but we don't provide support for this, as it is a very advanced usage of the library.

ghost · 2021-12-13T07:52:23Z

Hello @Callidon
Thank you so much for your reply.

Ok then I will not mess with the hash function itself. I will pre-run a similarity detection on my stream (which is a stream of strings).
Closing this.
Thanks!

folkvir · 2021-12-13T15:18:43Z

In newer version (soon I hope) you'll have the possibility to override a function for using any custom hash functions as long as you return a value of type Number.

…ding the serialize function (#48)

ghost · 2021-12-13T15:27:54Z

Thanks @folkvir , I'll keep an eye. for now, I will be running a similarity String prior to feeding topK. Very close Strings would be converted to the same String.
My case is looking for top searches on a website by the way.

Many thanks again!

@folkvir

Author: Arnausd Grall (@folkvir) * update deps and add gts * h64 only + fix #43 + gts * mocha: parallel execution * utils: rename *Indices to *Indexes, add better documentation * tests: show difference when using getDistinctIndexes w/wo hashing at each iteration * utils: rename *Indices to *Indexes * update ref pdf of CMS * double hashing: use the enhanced technique * getDistinctIndexes: implement #43 suggestions + allow to switch to XXH 32/64 if needed * update changelog * add .DS_Store in the list of ignored files * use typescript 4.X.X * fix typo of #46 * fix error when setting 32/64 bits xxh functions, and allow for overriding the serialize function (#48) * add code coverage * lint * update eslint ignore for faster parsing * add branch update_outdated to the tested branches * fix wrong github workflow branch name * workflow: use 16.x and remove 10.x * update README and remove useless deps * tests: fix error with describe/it rules overriding data * implemnting xor filter for the next release * update all files to the new project standard, set v2.0.0 * xor filters are working * testing the xor filter * xor filter #29 * #29 * documentation * code ql / tests on develop * keywords * fix workflow * move all hash related functions to the BaseFilter class * modify tests according to: 12417e7 * parallel mocha tests * use lts/* with setup-node@v2 * update README.md * add badge to the readme * export BaseFilter in the entry * badge refers to the last action build * update compatibility readme * update @types/node version to be the latest minor release * prettier auto line endings * update @types/node version to be the latest minor release * #49 add compatibility with 1.3.4 * #49 compatibility for BloomFilters import only is working * Use AutoExportable feature for simplfying the export process, bitsets size set to a multiple of bitwords * create Hashing classes * remove nyc, use eslint + prettier instead of gts, enforce type checking, make all propeties public, no readonly * update yarn.lock * ci: remove node 15 * fix missing dependency and fix eslint errors * update readme * fix conflicts with master, eslint is now a rule * add typedoc-plugin-missing-exports and fix non-overrided doubleHashing function

Callidon added the question label Dec 13, 2021

ghost closed this as completed Dec 13, 2021

folkvir added a commit that referenced this issue Dec 13, 2021

fix error when setting 32/64 bits xxh functions, and allow for overri…

9358c94

…ding the serialize function (#48)

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Has bloom filters a higher probability of collide when strings are similar #48

[Question] Has bloom filters a higher probability of collide when strings are similar #48

ghost commented Dec 12, 2021 •

edited by ghost

Loading

ghost commented Dec 12, 2021

Callidon commented Dec 13, 2021

ghost commented Dec 13, 2021 •

edited by ghost

Loading

folkvir commented Dec 13, 2021

ghost commented Dec 13, 2021

[Question] Has bloom filters a higher probability of collide when strings are similar #48

[Question] Has bloom filters a higher probability of collide when strings are similar #48

Comments

ghost commented Dec 12, 2021 • edited by ghost Loading

ghost commented Dec 12, 2021

Callidon commented Dec 13, 2021

ghost commented Dec 13, 2021 • edited by ghost Loading

folkvir commented Dec 13, 2021

ghost commented Dec 13, 2021

ghost commented Dec 12, 2021 •

edited by ghost

Loading

ghost commented Dec 13, 2021 •

edited by ghost

Loading