Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split parquet bloom filter config and enable bloom filter on read by default #10306

Merged
merged 14 commits into from
May 2, 2024

Conversation

lewiszlw
Copy link
Member

Which issue does this PR close?

Closes #10299.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Apr 30, 2024
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 30, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lewiszlw -- I think this looks great 🙏

I had some suggestions on naming, comments, and tests. Let me know what you think

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/common/src/config.rs Outdated Show resolved Hide resolved
@@ -515,7 +515,7 @@ statement ok
CREATE EXTERNAL TABLE data_index_bloom_encoding_stats STORED AS PARQUET LOCATION '../../parquet-testing/data/data_index_bloom_encoding_stats.parquet';

statement ok
set datafusion.execution.parquet.bloom_filter_enabled=true;
set datafusion.execution.parquet.bloom_filter_on_read_enabled=true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this test was to ensure we had end to end coverage, so now that we switched the default the meaning / coverage has changed

Perhaps we can change it so:

  1. verify the setting with show datafusion.execution.parquet.bloom_filter_on_read_enabled
  2. Add a second set of queries with set datafusion.execution.parquet.bloom_filter_on_read_enabled=false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome -- thank you @lewiszlw . This is a really nice improvement

cc @hiltontj

statement ok
set datafusion.execution.parquet.bloom_filter_enabled=true;
set datafusion.execution.parquet.bloom_filter_on_read=false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@hiltontj
Copy link
Contributor

hiltontj commented May 1, 2024

Thank you @lewiszlw for taking this one on!

@alamb alamb added the api change Changes the API exposed to users of the crate label May 1, 2024
@alamb alamb merged commit 6d77748 into apache:main May 2, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented May 2, 2024

Thanks again @lewiszlw

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable bloom filters by default on read
3 participants