Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): bloom filter ignore invalid keys #15637

Closed
wants to merge 9 commits into from

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented May 24, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

If a bloom filter for a column in a query returns Uncertain, which means that it cannot determine whether the query value exists in the block, such a bloom filter can be considered invalid because it does not reduce the number of block reads, but instead reads one more additional bloom filter, which brings additional query overhead. If there are too many such invalid bloom filters, it will lead to query performance degradation.

Thie PR add a new setting bloom_filter_ignore_invalid_key_ratio to represent the percentage of invalid bloom filters to the total bloom filters, the default value is 100, which means it is not enabled. Valid values are 70 -- 99, representing 70% to 99%. If the percentage of invalid bloom filtering exceeds this ratio, the bloom filtering of this column will be added to to cache and ignored in the following queries.

Add a new configuration item table_prune_bloom_filter_invalid_keys_count, which represents the number of invalid bloom filters that can be cached, the default value is 256, if the data of the cached value exceeds this value, the least frequently used value will be excluded.

  • Fixes #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label May 24, 2024
@b41sh b41sh force-pushed the feat-bloom-cache branch from 69d306d to 52c4882 Compare May 29, 2024 08:51
@b41sh b41sh requested review from sundy-li and dantengsky May 29, 2024 15:55
@b41sh b41sh marked this pull request as ready for review May 29, 2024 15:55
@dantengsky
Copy link
Member

dantengsky commented May 30, 2024

I think this feature might not be that reasonable (or maybe I haven't fully understood this PR).

for point query like select * from t where c = 'x', the way we detect the in-validness in this PR, can only indicate that for the value 'x' on certain blocks, bloom pruning is ineffective.

It doesn't mean that where c = 'y' is ineffective; or that where c = 'x' on other blocks is also ineffective.

We cannot assume that users will perform point queries on column c in a consistent pattern.

I suggest considering:

The cache key could be table_id + column_id + block_id + point_value; in other words, this corresponds to a block-level cache of the prune return value for a specific value in a specific column's bloom filter.

However, I'm not very sure if introducing this cache is reasonable when there is already a bloom filter cache.

Additionally:

If users confirm that the bloom filter for a certain column is almost ineffective, it is recommended that users specify not to generate a bloom filter index for this column when creating the table.

@BohuTANG
Copy link
Member

BohuTANG commented May 30, 2024

I prefer check a bloom filter index(or others indexs) if it is not good(hits/scan ratio) and skip it during runtime, the cache way seems complex and hard to scale to other index.
If we have an RFC and discussion is more better.

@b41sh b41sh marked this pull request as draft May 30, 2024 14:59
@b41sh b41sh added the ci-cloud Build docker image for cloud test label Jun 3, 2024
@everpcpc everpcpc added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jun 3, 2024
Copy link
Contributor

github-actions bot commented Jun 3, 2024

Docker Image for PR

  • tag: pr-15637-f37f7fa-1717408759

note: this image tag is only available for internal use,
please check the internal doc for more details.

@dantengsky dantengsky added the ci-benchmark Benchmark: run all test label Jun 5, 2024
Copy link
Contributor

github-actions bot commented Jun 5, 2024

Docker Image for PR

  • tag: pr-15637-eb940bb-1717588248

note: this image tag is only available for internal use,
please check the internal doc for more details.

@b41sh b41sh force-pushed the feat-bloom-cache branch from edda477 to 6f15a20 Compare June 12, 2024 03:39
@b41sh b41sh added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jun 12, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-15637-26c2220-1718167652

note: this image tag is only available for internal use,
please check the internal doc for more details.

@b41sh b41sh added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jun 13, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-15637-8c717e6-1718273230

note: this image tag is only available for internal use,
please check the internal doc for more details.

@b41sh b41sh added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Jun 13, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-15637-1759709-1718294858

note: this image tag is only available for internal use,
please check the internal doc for more details.

@b41sh
Copy link
Member Author

b41sh commented Jun 20, 2024

We added enable_bloom_filter_ignore_invalid_key setting, if it is enabled, 10% of the block will be selected as sample, and if the bloom filter of 70% of the block with a certain filter key is invalid, the remaining blocks will ignore this filter key. After testing, we found that if the bloom filter of the filter key in each block is invalid, it can have a more obvious optimisation effect. However, the performance of tpch is much worse when enable_bloom_filter_ignore_invalid_key is enabled. Since it is relatively rare that the bloom filter of each block is invalid, the effect of this PR is not very obvious, so we choose to close this PR and consider reopen it on in the future if necessary. For some bloom filters that are really invalid, we can ignore them by setting bloom_index_columns option when create table.

image

@b41sh b41sh closed this Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-benchmark Benchmark: run all test ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants