-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: Support writing parquet bloom filter #2642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java
Show resolved
Hide resolved
| } | ||
|
|
||
| public static boolean hasNonBloomFilterPages(ColumnChunkMetaData meta) { | ||
| return meta.getBloomFilterOffset() == -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the getBloomFilterOffset is marked as Private method, I would prefer not to use it in the production files , in case of breaking the compatibility when upgrading the parquet version in future.
Is there any other approach to check whether it generate bloom filter binary or not ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some search and didn't find another approach. ColumnChunkMetaData#getBloomFilterOffset is also used when reading bloom filter to judge if bloom filter is enabled, see ParquetFileReader#readBloomFilter.
However, I should change == -1 to < 0
| private static DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors | ||
| .builder(PageWriteStore.class) | ||
| private static DynConstructors.Ctor<ColumnChunkPageWriteStore> pageStoreCtorParquet = DynConstructors | ||
| .builder(ColumnChunkPageWriteStore.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ColumnChunkPageWriteStore is also marked as InterfaceAudience.Private in parquet-mr project, so we should not depend on this private interface explicty in java code in case of upgrading issues.
I can understand that we change it from PageWriterStore to ColumnChunkPageWriteStore because we will need to pass a BloomFilterWriteStore for the following newColumnWriteStore. Maybe we could change the way to express this logic:
Precondition.checkState(pageStore instance of BloomFilterWriteStore, "pageStore must be an instance of BloomFilterWriteStore");
this.writeStore = props.newColumnWriteStore(parquetSchema, pageStore,(BloomFilterWriteStore) pageStore);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More context: I think we use a DynConstructors.Ctor before because we are trying to avoid the code dependency from the private ColumnChunkPageWriteStore and preparing for the parquet version upgradtion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ColumnChunkPageWriteStore used to be a package-private class and has to be created by reflection. I didn't pay much attension to the InterfaceAudience.Private annotation and thanks for pointing it out.
c8a6a1f to
61de602
Compare
|
@jshmchenxi |
|
@jshmchenxi |
|
Hi @jshmchenxi, thanks for the contribution. We have some updates based on your works. Would you mind we submit another PR based on your works? |
61de602 to
c398226
Compare
|
Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. And thanks for reminding me! @huaxingao @ConeyLiu |
|
Thanks for rebasing the code! @jshmchenxi cc @aokolnychyi @RussellSpitzer @flyrain Could you please take a look when you have time? Thanks a lot in advance! |
| public static final String AVRO_COMPRESSION_DEFAULT = "gzip"; | ||
|
|
||
| public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled"; | ||
| public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the perf impact of writing bloom filer? Does it make sense to enable it by default if the perf impact is minor? Would be nice to include benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Yufei, thanks for the review. The performance impact of writing bloom filter should be negligible, though we didn't do a performance benchmark. The cost of bloom filter is space. The default size of bloom filter for one column is 1 MB in each parquet file. If there are N columns in the table, then the extra space cost is N MB in each file to enable bloom filter for all columns. It is more reasonable to enable bloom filter only for columns that is of high cardinality and often used in filter expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Make sense to disable it by default. Is there any perf test from Parquet or Spark side we can refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I found SPARK-35345 is adding perf test for Parquet bloom filter with Spark. And the author @huaxingao happens to be in this PR 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the Parquet bloom filter benchmark PR so it can be merged.
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch looks good to me overall. It'd be nice to have benchmark to understand its perf impact. I'm OK if this is planned and will be done in another PR.
| for (Map.Entry<String, String> entry : conf.entrySet()) { | ||
| for (ConfigHelper<?> helper : helpers) { | ||
| // We retrieve the value from function instead of parsing from the string here to use the exact implementations | ||
| // in Configuration | ||
| helper.processKey(entry.getKey()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract it to a function parseConfig(Iterable<Map.Entry<String,String>> entrySet) to avoid duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
| Preconditions.checkState(pageStore instanceof BloomFilterWriteStore, | ||
| "pageStore must be an instance of BloomFilterWriteStore"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check? It throws a ClassCastException at line 205 anyway if the type doesn't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code would be more readable when we add the check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused on how the type might not match here? We get the class due to reflection so do we have a chance here of getting the wrong class?
Seems like we may benefit from moving this into the DynamicConstructor code at the top of the file? I may misunderstand this but it seems like we should try to do this in the reflection itself rather than after we instantiate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the check is redundant here. I'll remove it.
Seems like we may benefit from moving this into the DynamicConstructor code at the top of the file?
We are checking the constructed instance against some interface type. Maybe it will help if we add constraint to DynamicConstructor .
site/docs/configuration.md
Outdated
| | write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size | | ||
| | write.parquet.compression-codec | gzip | Parquet compression codec | | ||
| | write.parquet.compression-level | null | Parquet compression level | | ||
| | write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions. How about this?
To enable or disable bloom filter for all columns by default. Partial enabling and disabling are possible by specifying the column name. For example, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, how does a user specify multiple columns? like this or something else?
write.parquet.bloom-filter-enabled#column1=false
write.parquet.bloom-filter-enabled#column2=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the format references properties defined in parquet-hadoop, like parquet.bloom.filter.enabled and parquet.bloom.filter.expected.ndv.
However, according to Ryan's comment, I will change the property pattern to write.parquet.bloom-filter.col1.enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it requires changes for the class ColumnConfigParser since the "#" isn't used.
| import org.apache.hadoop.conf.Configuration; | ||
|
|
||
| /** | ||
| * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any plan for actually making this public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with Ryan's comments that we should strive to keep this as similar to the way we setup parquet metrics as possible. I know this would actually effect the writer, while our other properties effect the metric level, but I think it makes sense to keep all the configurations alike.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I've updated the configurations to be simialr to metrics.
|
|
||
| ColumnChunkMetaData intColumn = rowGroup.getColumns().get(0); | ||
| BloomFilter intBloomFilter = bloomFilterDataReader.readBloomFilter(intColumn); | ||
| Assert.assertTrue(intBloomFilter.findHash(intBloomFilter.hash(30))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add in a few negative tests as well since all of these check for inclusion and none for exclusion. I'm not super worried about this since I assume this code path is already well tested in Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash values might conflict so I didn't add exclusion tests.
site/docs/configuration.md
Outdated
| | write.parquet.compression-level | null | Parquet compression level | | ||
| | write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` | | ||
| | write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset | | ||
| | write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor suggestions :
"The expected number of distinct values in a column, it is used to compute the optimal size of bytes of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size set inbloom-filter-max-bytes"
"This property overrides write.parquet.bloom-filter-enabled, automatically enabling bloom filters for any columns specified." then the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi everyone, does anybody know how to set these configuration when i use trino with iceberg . I have searched many places ,but does not found any useful info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check Trino source code? Cc @jackye1995
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to do some adaption in Trino likewise. @kingeasternsun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trino uses completely different parquet readers so the properties don't all translate over. Basically any changes we make in Iceberg Parquet readers will not effect trino.
8752f59 to
63c7acf
Compare
63c7acf to
6eb73d3
Compare
| .withMaxBloomFilterBytes(bloomFilterMaxBytes) | ||
| .withBloomFilterEnabled(bloomFilterEnabled); | ||
|
|
||
| new ColumnConfigParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks a bit weird, is this intended, new instance without reference to it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ColumnConfigParser instance is created to parse column configs and apply to propsBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's probably not a big deal, but I mean could new instances be avoided ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser is designed to be used in this way, just like we need creating builder instances to build objects.
|
Close in favor of #4831 |
Split #2582 into several PRs.
This part adds support for writing parquet bloom filter.
Add 3 new TableProperties. The definition is similar to apache/parquet-mr
write.parquet.bloom-filter-enabled=trueandwrite.parquet.bloom-filter-enabled#some_column=falsewill enable bloom filter for all columns exceptsome_columnwrite.parquet.bloom-filter-enabledproperty; For example, settingwrite.parquet.bloom-filter-expected-ndv#some_column=200will enable bloom filter forsome_columnwith expected number of distinct values equals to 200