Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Has bloom filters a higher probability of collide when strings are similar #48

Closed
ghost opened this issue Dec 12, 2021 · 5 comments
Labels

Comments

@ghost
Copy link

ghost commented Dec 12, 2021

Because if it is the case, this really suits my need.

I'm trying to discover popular searches on a website. For this, I'm using TopK algorithm which is based on Bloom Filter hashing.

I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.

Thanks a lot @Callidon !!!

@ghost
Copy link
Author

ghost commented Dec 12, 2021

if not, how can I replace the Hash function used with another one more permissive

@Callidon
Copy link
Owner

Hi 👋
Thank you for your interest in our library!

I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.

Under the hood, we are using xxHash. Their website contains a full benchmark, including collision probabilities. You should be able to find all information you need there.

if not, how can I replace the Hash function used with another one more permissive

You can change the seed used by the hash functions, but I'm afraid we do not have an API for swapping the hash function. The reason is that all the data-structures implemented by our package are carefully designed around xxHash. However, you can override any classes and their methods to customize how values are hashed, but we don't provide support for this, as it is a very advanced usage of the library.

@ghost
Copy link
Author

ghost commented Dec 13, 2021

Hello @Callidon
Thank you so much for your reply.

Ok then I will not mess with the hash function itself. I will pre-run a similarity detection on my stream (which is a stream of strings).
Closing this.
Thanks!

@ghost ghost closed this as completed Dec 13, 2021
@folkvir
Copy link
Collaborator

folkvir commented Dec 13, 2021

In newer version (soon I hope) you'll have the possibility to override a function for using any custom hash functions as long as you return a value of type Number.

folkvir added a commit that referenced this issue Dec 13, 2021
@ghost
Copy link
Author

ghost commented Dec 13, 2021

Thanks @folkvir , I'll keep an eye. for now, I will be running a similarity String prior to feeding topK. Very close Strings would be converted to the same String.
My case is looking for top searches on a website by the way.

Many thanks again!

Callidon pushed a commit that referenced this issue Feb 18, 2022
Author: Arnausd Grall (@folkvir)

* update deps and add gts

* h64 only + fix #43 + gts

* mocha: parallel execution

* utils: rename *Indices to *Indexes, add better documentation

* tests: show difference when using getDistinctIndexes w/wo hashing at each iteration

* utils: rename *Indices to *Indexes

* update ref pdf of CMS

* double hashing: use the enhanced technique

* getDistinctIndexes: implement #43 suggestions + allow to switch to XXH 32/64 if needed

* update changelog

* add .DS_Store in the list of ignored files

* use typescript 4.X.X

* fix typo of #46

* fix error when setting 32/64 bits xxh functions, and allow for overriding the serialize function (#48)

* add code coverage

* lint

* update eslint ignore for faster parsing

* add branch update_outdated to the tested branches

* fix wrong github workflow branch name

* workflow: use 16.x and remove 10.x

* update README and remove useless deps

* tests: fix error with describe/it rules overriding data

* implemnting xor filter for the next release

* update all files to the new project standard, set v2.0.0

* xor filters are working

* testing the xor filter

* xor filter #29

* #29

* documentation

* code ql / tests on develop

* keywords

* fix workflow

* move all hash related functions to the BaseFilter class

* modify tests according to: 12417e7

* parallel mocha tests

* use lts/* with setup-node@v2

* update README.md

* add badge to the readme

* export BaseFilter in the entry

* badge refers to the last action build

* update compatibility readme

* update @types/node version to be the latest minor release

* prettier auto line endings

* update @types/node version to be the latest minor release

* #49 add compatibility with 1.3.4

* #49 compatibility for BloomFilters import only is working

* Use AutoExportable feature for simplfying the export process, bitsets size set to a multiple of bitwords

* create Hashing classes

* remove nyc, use eslint + prettier instead of gts, enforce type checking, make all propeties public, no readonly

* update yarn.lock

* ci: remove node 15

* fix missing dependency and fix eslint errors

* update readme

* fix conflicts with master, eslint is now a rule

* add typedoc-plugin-missing-exports and fix non-overrided doubleHashing function
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants