-
Notifications
You must be signed in to change notification settings - Fork 3k
Add support for Parquet BloomFilter #2582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java
Show resolved
Hide resolved
|
@chenjunjiedada Hi, would you please help review this patch? Thanks! |
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java
Show resolved
Hide resolved
| optional(12, "_all_nans", DoubleType.get()), | ||
| optional(13, "_some_nans", FloatType.get()), | ||
| optional(14, "_no_nans", DoubleType.get()), | ||
| optional(15, "_struct_not_null", _structFieldType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a test for this column using an in or equals predicate to ensure that the existence of a bloom filter on a file for a query against a field that doesn't work for the bloom filter doesn't throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll try later. Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will need a fully covered test which address all the data types from Type.java, the selected Integer, Double, String, Float are not enough.
parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java
Show resolved
Hide resolved
|
@jshmchenxi, I think this should be done in several PRs instead of one. First, we would need to update the Parquet version, then we would want to add read support and finally we would add write support. That will help keep the changes to a size where reviewers can get through them in a reasonable amount of time. I also think that we need to more carefully consider how to configure Parquet's bloom filters. I would expect what you've added here as table properties to be column specific. Why did you choose global settings. Does this create a bloom filter with the same NDV for all columns? |
Agreed on parquet versions. With the number of supported spark versions, it would be difficult to bring up parquet 1.12 (as great as it is) without some consideration by major stakeholders. |
| | write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size | | ||
| | write.parquet.compression-codec | gzip | Parquet compression codec | | ||
| | write.parquet.compression-level | null | Parquet compression level | | ||
| | write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; It is also possible to enable it for some columns by specifying the column name within the property followed by # | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that we need to more carefully consider how to configure Parquet's bloom filters. I would expect what you've added here as table properties to be column specific. Why did you choose global settings. Does this create a bloom filter with the same NDV for all columns?
@rdblue Yes, write.parquet.bloom-filter-enabled and write.parquet.bloom-filter-expected-ndv both support column specific settings. We can set write.parquet.bloom-filter-enabled#user_id=true and write.parquet.bloom-filter-expected-ndv#user_id=1000 to just enable bloom filter for column user_id with NDV 1000.
I'll make the doc more complete in new PRs.
| * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class. | ||
| * Parses the specified key-values in the format of root.key#column.path from a {@link Configuration} object. | ||
| */ | ||
| class ColumnConfigParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg doesn't use the same names that Parquet would, and it also doesn't use a Configuration to store properties. We need to think about what would make sense for Iceberg here, and using # to delimit properties is probably too confusing.
I think that the properties proposed in this PR for global defaults make sense, like write.parquet.bloom-filter-enabled, although the NDV default is probably not useful given that we expect NDV to vary widely across fields. For the column-specific settings, I think we may want to follow the same pattern that is used by metrics collection. That embeds the column name in the property, like write.metadata.metrics.column.col1. This could be write.parquet.bloom-filter.col1.enabled or write.parquet.bloom-filter.col1.max-bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will try to change the configuration pattern when I have time.
| this.recordCount = 0; | ||
|
|
||
| PageWriteStore pageStore = pageStoreCtorParquet.newInstance( | ||
| ColumnChunkPageWriteStore pageStore = pageStoreCtorParquet.newInstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there write-side changes in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .build()) { | ||
| GenericRecordBuilder builder = new GenericRecordBuilder(convert(FILE_SCHEMA, "table")); | ||
| // create 50 records | ||
| for (int i = 0; i < INT_VALUE_COUNT; i += 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use the org.apache.iceberg.data.RandomGenericData#generate to generate random Record for testing purpose because it could cover almost all the corner cases that will encounter in the real production (Actually, I detected several bugs when I use the RandomGenericData to mock data and run unit tests). I think we could also use it here. For example, we may generate several records into a collection and then check whether all the values from the given column are shown positive in the parquet bloom filter binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@openinx Thanks! That's a good idea. I'll try using the RandomGenericData utility to generate test cases.
|
Close in favor of #4831 |
For #2391, add Parquet BloomFilter support to Iceberg.
Upgrade Parquet version to 1.12.0 and add ParquetBloomRowGroupFilter similar to ParquetDictionaryRowGroupFilter.
ExpressionVisitor is implemented with refer to org.apache.parquet.filter2.bloomfilterlevel.BloomFilterImpl.
BloomFilter is helpful only with eq() and in() expression. It can not help filtering rows with other expressions like gt() or notEq().
Add 3 new properties to TableProperties. The definition is similar to apache/parquet-mr