Skip to content

Conversation

@jshmchenxi
Copy link
Contributor

Split #2582 into several PRs.
This part adds support for writing parquet bloom filter.

Add 3 new TableProperties. The definition is similar to apache/parquet-mr

Property Default Description
write.parquet.bloom-filter-enabled false Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both write.parquet.bloom-filter-enabled=true and write.parquet.bloom-filter-enabled#some_column=false will enable bloom filter for all columns except some_column
write.parquet.bloom-filter-max-bytes 1048576 (1 MB) The maximum number of bytes for a bloom filter bitset
write.parquet.bloom-filter-expected-ndv (not set) The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with write.parquet.bloom-filter-enabled property; For example, setting write.parquet.bloom-filter-expected-ndv#some_column=200 will enable bloom filter for some_column with expected number of distinct values equals to 200

}

public static boolean hasNonBloomFilterPages(ColumnChunkMetaData meta) {
return meta.getBloomFilterOffset() == -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the getBloomFilterOffset is marked as Private method, I would prefer not to use it in the production files , in case of breaking the compatibility when upgrading the parquet version in future.

Is there any other approach to check whether it generate bloom filter binary or not ?

Copy link
Contributor Author

@jshmchenxi jshmchenxi Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some search and didn't find another approach. ColumnChunkMetaData#getBloomFilterOffset is also used when reading bloom filter to judge if bloom filter is enabled, see ParquetFileReader#readBloomFilter.
However, I should change == -1 to < 0

private static DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors
.builder(PageWriteStore.class)
private static DynConstructors.Ctor<ColumnChunkPageWriteStore> pageStoreCtorParquet = DynConstructors
.builder(ColumnChunkPageWriteStore.class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ColumnChunkPageWriteStore is also marked as InterfaceAudience.Private in parquet-mr project, so we should not depend on this private interface explicty in java code in case of upgrading issues.

I can understand that we change it from PageWriterStore to ColumnChunkPageWriteStore because we will need to pass a BloomFilterWriteStore for the following newColumnWriteStore. Maybe we could change the way to express this logic:

Precondition.checkState(pageStore instance of BloomFilterWriteStore, "pageStore must be an instance of BloomFilterWriteStore");
this.writeStore = props.newColumnWriteStore(parquetSchema, pageStore,(BloomFilterWriteStore) pageStore);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More context: I think we use a DynConstructors.Ctor before because we are trying to avoid the code dependency from the private ColumnChunkPageWriteStore and preparing for the parquet version upgradtion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ColumnChunkPageWriteStore used to be a package-private class and has to be created by reflection. I didn't pay much attension to the InterfaceAudience.Private annotation and thanks for pointing it out.

@jshmchenxi jshmchenxi force-pushed the bloom-filter-write branch from c8a6a1f to 61de602 Compare July 12, 2021 15:24
@huaxingao
Copy link
Contributor

@jshmchenxi
Are you still working on this PR? Could you please update the PR so it can be reviewed? Thanks a lot!

@huaxingao
Copy link
Contributor

@jshmchenxi
Do you still have time to work on this PR? If not, is it OK with you that either I or other person submit a new PR based on your work? We will list you as co-author and you will get commit credits. Please let me know if this is OK with you. Thank you very much!

@ConeyLiu
Copy link
Contributor

ConeyLiu commented Nov 2, 2021

Hi @jshmchenxi, thanks for the contribution. We have some updates based on your works. Would you mind we submit another PR based on your works?

@jshmchenxi
Copy link
Contributor Author

jshmchenxi commented Nov 3, 2021

Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. And thanks for reminding me! @huaxingao @ConeyLiu

@huaxingao
Copy link
Contributor

Thanks for rebasing the code! @jshmchenxi

cc @aokolnychyi @RussellSpitzer @flyrain Could you please take a look when you have time? Thanks a lot in advance!

public static final String AVRO_COMPRESSION_DEFAULT = "gzip";

public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled";
public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;
Copy link
Contributor

@flyrain flyrain Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the perf impact of writing bloom filer? Does it make sense to enable it by default if the perf impact is minor? Would be nice to include benchmarks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Yufei, thanks for the review. The performance impact of writing bloom filter should be negligible, though we didn't do a performance benchmark. The cost of bloom filter is space. The default size of bloom filter for one column is 1 MB in each parquet file. If there are N columns in the table, then the extra space cost is N MB in each file to enable bloom filter for all columns. It is more reasonable to enable bloom filter only for columns that is of high cardinality and often used in filter expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Make sense to disable it by default. Is there any perf test from Parquet or Spark side we can refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found SPARK-35345 is adding perf test for Parquet bloom filter with Spark. And the author @huaxingao happens to be in this PR 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the Parquet bloom filter benchmark PR so it can be merged.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks good to me overall. It'd be nice to have benchmark to understand its perf impact. I'm OK if this is planned and will be done in another PR.

Comment on lines 74 to 67
for (Map.Entry<String, String> entry : conf.entrySet()) {
for (ConfigHelper<?> helper : helpers) {
// We retrieve the value from function instead of parsing from the string here to use the exact implementations
// in Configuration
helper.processKey(entry.getKey());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract it to a function parseConfig(Iterable<Map.Entry<String,String>> entrySet) to avoid duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Comment on lines 201 to 202
Preconditions.checkState(pageStore instanceof BloomFilterWriteStore,
"pageStore must be an instance of BloomFilterWriteStore");
Copy link
Contributor

@flyrain flyrain Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check? It throws a ClassCastException at line 205 anyway if the type doesn't match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code would be more readable when we add the check here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused on how the type might not match here? We get the class due to reflection so do we have a chance here of getting the wrong class?

Seems like we may benefit from moving this into the DynamicConstructor code at the top of the file? I may misunderstand this but it seems like we should try to do this in the reflection itself rather than after we instantiate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the check is redundant here. I'll remove it.

Seems like we may benefit from moving this into the DynamicConstructor code at the top of the file?

We are checking the constructed instance against some interface type. Maybe it will help if we add constraint to DynamicConstructor .

| write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size |
| write.parquet.compression-codec | gzip | Parquet compression codec |
| write.parquet.compression-level | null | Parquet compression level |
| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions. How about this?

To enable or disable bloom filter for all columns by default. Partial enabling and disabling are possible by specifying the column name. For example, ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, how does a user specify multiple columns? like this or something else?

write.parquet.bloom-filter-enabled#column1=false
write.parquet.bloom-filter-enabled#column2=false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the format references properties defined in parquet-hadoop, like parquet.bloom.filter.enabled and parquet.bloom.filter.expected.ndv.
However, according to Ryan's comment, I will change the property pattern to write.parquet.bloom-filter.col1.enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it requires changes for the class ColumnConfigParser since the "#" isn't used.

import org.apache.hadoop.conf.Configuration;

/**
* TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any plan for actually making this public?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with Ryan's comments that we should strive to keep this as similar to the way we setup parquet metrics as possible. I know this would actually effect the writer, while our other properties effect the metric level, but I think it makes sense to keep all the configurations alike.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I've updated the configurations to be simialr to metrics.


ColumnChunkMetaData intColumn = rowGroup.getColumns().get(0);
BloomFilter intBloomFilter = bloomFilterDataReader.readBloomFilter(intColumn);
Assert.assertTrue(intBloomFilter.findHash(intBloomFilter.hash(30)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add in a few negative tests as well since all of these check for inclusion and none for exclusion. I'm not super worried about this since I assume this code path is already well tested in Parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash values might conflict so I didn't add exclusion tests.

| write.parquet.compression-level | null | Parquet compression level |
| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions :
"The expected number of distinct values in a column, it is used to compute the optimal size of bytes of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size set inbloom-filter-max-bytes"

"This property overrides write.parquet.bloom-filter-enabled, automatically enabling bloom filters for any columns specified." then the example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi everyone, does anybody know how to set these configuration when i use trino with iceberg . I have searched many places ,but does not found any useful info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check Trino source code? Cc @jackye1995

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to do some adaption in Trino likewise. @kingeasternsun

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trino uses completely different parquet readers so the properties don't all translate over. Basically any changes we make in Iceberg Parquet readers will not effect trino.

@jshmchenxi jshmchenxi force-pushed the bloom-filter-write branch 2 times, most recently from 8752f59 to 63c7acf Compare November 21, 2021 07:51
.withMaxBloomFilterBytes(bloomFilterMaxBytes)
.withBloomFilterEnabled(bloomFilterEnabled);

new ColumnConfigParser()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks a bit weird, is this intended, new instance without reference to it ?

Copy link
Contributor Author

@jshmchenxi jshmchenxi Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ColumnConfigParser instance is created to parse column configs and apply to propsBuilder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's probably not a big deal, but I mean could new instances be avoided ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser is designed to be used in this way, just like we need creating builder instances to build objects.

@jshmchenxi
Copy link
Contributor Author

Close in favor of #4831

@jshmchenxi jshmchenxi closed this May 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants