feat(storage): bloom filter ignore invalid keys #15637

b41sh · 2024-05-24T11:11:14Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

If a bloom filter for a column in a query returns Uncertain, which means that it cannot determine whether the query value exists in the block, such a bloom filter can be considered invalid because it does not reduce the number of block reads, but instead reads one more additional bloom filter, which brings additional query overhead. If there are too many such invalid bloom filters, it will lead to query performance degradation.

Thie PR add a new setting bloom_filter_ignore_invalid_key_ratio to represent the percentage of invalid bloom filters to the total bloom filters, the default value is 100, which means it is not enabled. Valid values are 70 -- 99, representing 70% to 99%. If the percentage of invalid bloom filtering exceeds this ratio, the bloom filtering of this column will be added to to cache and ignored in the following queries.

Add a new configuration item table_prune_bloom_filter_invalid_keys_count, which represents the number of invalid bloom filters that can be cached, the default value is 256, if the data of the cached value exceeds this value, the least frequently used value will be excluded.

Fixes #[Link the issue here]

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

dantengsky · 2024-05-30T09:26:20Z

I think this feature might not be that reasonable (or maybe I haven't fully understood this PR).

for point query like select * from t where c = 'x', the way we detect the in-validness in this PR, can only indicate that for the value 'x' on certain blocks, bloom pruning is ineffective.

It doesn't mean that where c = 'y' is ineffective; or that where c = 'x' on other blocks is also ineffective.

We cannot assume that users will perform point queries on column c in a consistent pattern.

I suggest considering:

The cache key could be table_id + column_id + block_id + point_value; in other words, this corresponds to a block-level cache of the prune return value for a specific value in a specific column's bloom filter.

However, I'm not very sure if introducing this cache is reasonable when there is already a bloom filter cache.

Additionally:

If users confirm that the bloom filter for a certain column is almost ineffective, it is recommended that users specify not to generate a bloom filter index for this column when creating the table.

BohuTANG · 2024-05-30T11:32:48Z

I prefer check a bloom filter index(or others indexs) if it is not good(hits/scan ratio) and skip it during runtime, the cache way seems complex and hard to scale to other index.
If we have an RFC and discussion is more better.

github-actions · 2024-06-03T10:00:30Z

Docker Image for PR

tag: pr-15637-f37f7fa-1717408759

note: this image tag is only available for internal use,
please check the internal doc for more details.

github-actions · 2024-06-05T11:52:05Z

Docker Image for PR

tag: pr-15637-eb940bb-1717588248

note: this image tag is only available for internal use,
please check the internal doc for more details.

github-actions · 2024-06-05T12:28:27Z

ClickBench Report

github-actions · 2024-06-12T04:48:28Z

Docker Image for PR

tag: pr-15637-26c2220-1718167652

note: this image tag is only available for internal use,
please check the internal doc for more details.

github-actions · 2024-06-12T05:26:25Z

ClickBench Report

github-actions · 2024-06-13T10:08:09Z

Docker Image for PR

tag: pr-15637-8c717e6-1718273230

note: this image tag is only available for internal use,
please check the internal doc for more details.

github-actions · 2024-06-13T10:44:53Z

ClickBench Report

github-actions · 2024-06-13T16:08:52Z

Docker Image for PR

tag: pr-15637-1759709-1718294858

note: this image tag is only available for internal use,
please check the internal doc for more details.

github-actions · 2024-06-13T16:48:20Z

ClickBench Report

b41sh · 2024-06-20T02:43:06Z

We added enable_bloom_filter_ignore_invalid_key setting, if it is enabled, 10% of the block will be selected as sample, and if the bloom filter of 70% of the block with a certain filter key is invalid, the remaining blocks will ignore this filter key. After testing, we found that if the bloom filter of the filter key in each block is invalid, it can have a more obvious optimisation effect. However, the performance of tpch is much worse when enable_bloom_filter_ignore_invalid_key is enabled. Since it is relatively rare that the bloom filter of each block is invalid, the effect of this PR is not very obvious, so we choose to close this PR and consider reopen it on in the future if necessary. For some bloom filters that are really invalid, we can ignore them by setting bloom_index_columns option when create table.

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label May 24, 2024

b41sh force-pushed the feat-bloom-cache branch from 69d306d to 52c4882 Compare May 29, 2024 08:51

b41sh requested review from sundy-li and dantengsky May 29, 2024 15:55

b41sh marked this pull request as ready for review May 29, 2024 15:55

sundy-li approved these changes May 30, 2024

View reviewed changes

b41sh marked this pull request as draft May 30, 2024 14:59

b41sh added the ci-cloud Build docker image for cloud test label Jun 3, 2024

everpcpc added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jun 3, 2024

dantengsky added the ci-benchmark Benchmark: run all test label Jun 5, 2024

b41sh added 3 commits June 12, 2024 11:16

feat(storage): bloom filter ignore invalid keys

c2cbaed

fix tests

f8dfd91

use sample bloom filter

6f15a20

b41sh force-pushed the feat-bloom-cache branch from edda477 to 6f15a20 Compare June 12, 2024 03:39

b41sh added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jun 12, 2024

b41sh added 4 commits June 13, 2024 14:52

merge

41c4ea7

invalid keys as option

bd31d9d

fix

3a2ba7e

fix

7d6deb0

b41sh added 2 commits June 13, 2024 17:20

fix

85ddcda

Merge branch 'main' into feat-bloom-cache

a16beb3

b41sh added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jun 13, 2024

b41sh added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Jun 13, 2024

b41sh closed this Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): bloom filter ignore invalid keys #15637

feat(storage): bloom filter ignore invalid keys #15637

b41sh commented May 24, 2024 •

edited

Loading

dantengsky commented May 30, 2024 •

edited

Loading

BohuTANG commented May 30, 2024 •

edited

Loading

github-actions bot commented Jun 3, 2024

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 13, 2024

github-actions bot commented Jun 13, 2024

github-actions bot commented Jun 13, 2024

github-actions bot commented Jun 13, 2024

b41sh commented Jun 20, 2024

feat(storage): bloom filter ignore invalid keys #15637

feat(storage): bloom filter ignore invalid keys #15637

Conversation

b41sh commented May 24, 2024 • edited Loading

Summary

Tests

Type of change

dantengsky commented May 30, 2024 • edited Loading

BohuTANG commented May 30, 2024 • edited Loading

github-actions bot commented Jun 3, 2024

Docker Image for PR

github-actions bot commented Jun 5, 2024

Docker Image for PR

github-actions bot commented Jun 5, 2024

ClickBench Report

github-actions bot commented Jun 12, 2024

Docker Image for PR

github-actions bot commented Jun 12, 2024

ClickBench Report

github-actions bot commented Jun 13, 2024

Docker Image for PR

github-actions bot commented Jun 13, 2024

ClickBench Report

github-actions bot commented Jun 13, 2024

Docker Image for PR

github-actions bot commented Jun 13, 2024

ClickBench Report

b41sh commented Jun 20, 2024

b41sh commented May 24, 2024 •

edited

Loading

dantengsky commented May 30, 2024 •

edited

Loading

BohuTANG commented May 30, 2024 •

edited

Loading