-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): bloom filter ignore invalid keys #15637
Conversation
I think this feature might not be that reasonable (or maybe I haven't fully understood this PR). for point query like It doesn't mean that where c = 'y' is ineffective; or that where c = 'x' on other blocks is also ineffective. We cannot assume that users will perform point queries on column c in a consistent pattern. I suggest considering: The cache key could be table_id + column_id + block_id + point_value; in other words, this corresponds to a block-level cache of the prune return value for a specific value in a specific column's bloom filter. However, I'm not very sure if introducing this cache is reasonable when there is already a bloom filter cache. Additionally: If users confirm that the bloom filter for a certain column is almost ineffective, it is recommended that users specify not to generate a bloom filter index for this column when creating the table. |
I prefer check a bloom filter index(or others indexs) if it is not good(hits/scan ratio) and skip it during runtime, the cache way seems complex and hard to scale to other index. |
Docker Image for PR
|
Docker Image for PR
|
Docker Image for PR
|
Docker Image for PR
|
Docker Image for PR
|
We added |
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
If a bloom filter for a column in a query returns
Uncertain
, which means that it cannot determine whether the query value exists in the block, such a bloom filter can be considered invalid because it does not reduce the number of block reads, but instead reads one more additional bloom filter, which brings additional query overhead. If there are too many such invalid bloom filters, it will lead to query performance degradation.Thie PR add a new setting
bloom_filter_ignore_invalid_key_ratio
to represent the percentage of invalid bloom filters to the total bloom filters, the default value is 100, which means it is not enabled. Valid values are 70 -- 99, representing 70% to 99%. If the percentage of invalid bloom filtering exceeds this ratio, the bloom filtering of this column will be added to to cache and ignored in the following queries.Add a new configuration item
table_prune_bloom_filter_invalid_keys_count
, which represents the number of invalid bloom filters that can be cached, the default value is 256, if the data of the cached value exceeds this value, the least frequently used value will be excluded.Tests
Type of change
This change is