Core: Support writing parquet bloom filter #2642

jshmchenxi · 2021-05-27T06:59:52Z

Split #2582 into several PRs.
This part adds support for writing parquet bloom filter.

Add 3 new TableProperties. The definition is similar to apache/parquet-mr

Property	Default	Description
write.parquet.bloom-filter-enabled	false	Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column`
write.parquet.bloom-filter-max-bytes	1048576 (1 MB)	The maximum number of bytes for a bloom filter bitset
write.parquet.bloom-filter-expected-ndv	(not set)	The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200

versions.props

parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java

openinx · 2021-07-12T06:53:58Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java

  }

+  public static boolean hasNonBloomFilterPages(ColumnChunkMetaData meta) {
+    return meta.getBloomFilterOffset() == -1;


As the getBloomFilterOffset is marked as Private method, I would prefer not to use it in the production files , in case of breaking the compatibility when upgrading the parquet version in future.

Is there any other approach to check whether it generate bloom filter binary or not ?

I did some search and didn't find another approach. ColumnChunkMetaData#getBloomFilterOffset is also used when reading bloom filter to judge if bloom filter is enabled, see ParquetFileReader#readBloomFilter.
However, I should change == -1 to < 0

openinx · 2021-07-12T07:15:39Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

-  private static DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors
-          .builder(PageWriteStore.class)
+  private static DynConstructors.Ctor<ColumnChunkPageWriteStore> pageStoreCtorParquet = DynConstructors
+          .builder(ColumnChunkPageWriteStore.class)


The ColumnChunkPageWriteStore is also marked as InterfaceAudience.Private in parquet-mr project, so we should not depend on this private interface explicty in java code in case of upgrading issues.

I can understand that we change it from PageWriterStore to ColumnChunkPageWriteStore because we will need to pass a BloomFilterWriteStore for the following newColumnWriteStore. Maybe we could change the way to express this logic:

Precondition.checkState(pageStore instance of BloomFilterWriteStore, "pageStore must be an instance of BloomFilterWriteStore"); this.writeStore = props.newColumnWriteStore(parquetSchema, pageStore,(BloomFilterWriteStore) pageStore);

More context: I think we use a DynConstructors.Ctor before because we are trying to avoid the code dependency from the private ColumnChunkPageWriteStore and preparing for the parquet version upgradtion.

Yes, ColumnChunkPageWriteStore used to be a package-private class and has to be created by reflection. I didn't pay much attension to the InterfaceAudience.Private annotation and thanks for pointing it out.

huaxingao · 2021-10-29T02:21:07Z

@jshmchenxi
Are you still working on this PR? Could you please update the PR so it can be reviewed? Thanks a lot!

huaxingao · 2021-11-02T02:24:42Z

@jshmchenxi
Do you still have time to work on this PR? If not, is it OK with you that either I or other person submit a new PR based on your work? We will list you as co-author and you will get commit credits. Please let me know if this is OK with you. Thank you very much!

ConeyLiu · 2021-11-02T06:02:04Z

Hi @jshmchenxi, thanks for the contribution. We have some updates based on your works. Would you mind we submit another PR based on your works?

jshmchenxi · 2021-11-03T06:37:05Z

Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. And thanks for reminding me! @huaxingao @ConeyLiu

huaxingao · 2021-11-03T06:51:57Z

Thanks for rebasing the code! @jshmchenxi

cc @aokolnychyi @RussellSpitzer @flyrain Could you please take a look when you have time? Thanks a lot in advance!

flyrain · 2021-11-09T00:41:07Z

core/src/main/java/org/apache/iceberg/TableProperties.java

  public static final String AVRO_COMPRESSION_DEFAULT = "gzip";

+  public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled";
+  public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;


What's the perf impact of writing bloom filer? Does it make sense to enable it by default if the perf impact is minor? Would be nice to include benchmarks?

Hi, Yufei, thanks for the review. The performance impact of writing bloom filter should be negligible, though we didn't do a performance benchmark. The cost of bloom filter is space. The default size of bloom filter for one column is 1 MB in each parquet file. If there are N columns in the table, then the extra space cost is N MB in each file to enable bloom filter for all columns. It is more reasonable to enable bloom filter only for columns that is of high cardinality and often used in filter expressions.

Thanks for the explanation. Make sense to disable it by default. Is there any perf test from Parquet or Spark side we can refer to?

Yes, I found SPARK-35345 is adding perf test for Parquet bloom filter with Spark. And the author @huaxingao happens to be in this PR 😄

I will update the Parquet bloom filter benchmark PR so it can be merged.

flyrain

The patch looks good to me overall. It'd be nice to have benchmark to understand its perf impact. I'm OK if this is planned and will be done in another PR.

flyrain · 2021-11-09T00:59:29Z

parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java

+    for (Map.Entry<String, String> entry : conf.entrySet()) {
+      for (ConfigHelper<?> helper : helpers) {
+        // We retrieve the value from function instead of parsing from the string here to use the exact implementations
+        // in Configuration
+        helper.processKey(entry.getKey());
+      }
+    }


extract it to a function parseConfig(Iterable<Map.Entry<String,String>> entrySet) to avoid duplication?

Good point.

flyrain · 2021-11-09T01:15:59Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

+    Preconditions.checkState(pageStore instanceof BloomFilterWriteStore,
+        "pageStore must be an instance of BloomFilterWriteStore");


Do we need this check? It throws a ClassCastException at line 205 anyway if the type doesn't match.

The code would be more readable when we add the check here.

I'm a little confused on how the type might not match here? We get the class due to reflection so do we have a chance here of getting the wrong class?

Seems like we may benefit from moving this into the DynamicConstructor code at the top of the file? I may misunderstand this but it seems like we should try to do this in the reflection itself rather than after we instantiate?

Yes, the check is redundant here. I'll remove it.

Seems like we may benefit from moving this into the DynamicConstructor code at the top of the file?

We are checking the constructed instance against some interface type. Maybe it will help if we add constraint to DynamicConstructor .

flyrain · 2021-11-09T01:23:04Z

site/docs/configuration.md

 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |


Minor suggestions. How about this?

To enable or disable bloom filter for all columns by default. Partial enabling and disabling are possible by specifying the column name. For example, ...

BTW, how does a user specify multiple columns? like this or something else?

write.parquet.bloom-filter-enabled#column1=false write.parquet.bloom-filter-enabled#column2=false

Yes, the format references properties defined in parquet-hadoop, like parquet.bloom.filter.enabled and parquet.bloom.filter.expected.ndv.
However, according to Ryan's comment, I will change the property pattern to write.parquet.bloom-filter.col1.enabled.

OK, it requires changes for the class ColumnConfigParser since the "#" isn't used.

RussellSpitzer · 2021-11-16T20:12:50Z

parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java

+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.


Is there any plan for actually making this public?

I think I agree with Ryan's comments that we should strive to keep this as similar to the way we setup parquet metrics as possible. I know this would actually effect the writer, while our other properties effect the metric level, but I think it makes sense to keep all the configurations alike.

Thanks for the review. I've updated the configurations to be simialr to metrics.

RussellSpitzer · 2021-11-16T20:46:40Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java

+
+      ColumnChunkMetaData intColumn = rowGroup.getColumns().get(0);
+      BloomFilter intBloomFilter = bloomFilterDataReader.readBloomFilter(intColumn);
+      Assert.assertTrue(intBloomFilter.findHash(intBloomFilter.hash(30)));


I think we should add in a few negative tests as well since all of these check for inclusion and none for exclusion. I'm not super worried about this since I assume this code path is already well tested in Parquet.

The hash values might conflict so I didn't add exclusion tests.

RussellSpitzer · 2021-11-16T20:53:18Z

site/docs/configuration.md

 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
+| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |


A few minor suggestions :
"The expected number of distinct values in a column, it is used to compute the optimal size of bytes of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size set inbloom-filter-max-bytes"

"This property overrides write.parquet.bloom-filter-enabled, automatically enabling bloom filters for any columns specified." then the example?

hi everyone, does anybody know how to set these configuration when i use trino with iceberg . I have searched many places ,but does not found any useful info.

Did you check Trino source code? Cc @jackye1995

You have to do some adaption in Trino likewise. @kingeasternsun

Trino uses completely different parquet readers so the properties don't all translate over. Basically any changes we make in Iceberg Parquet readers will not effect trino.

zinking · 2022-02-07T10:14:00Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+            .withMaxBloomFilterBytes(bloomFilterMaxBytes)
+            .withBloomFilterEnabled(bloomFilterEnabled);
+
+        new ColumnConfigParser()


looks a bit weird, is this intended, new instance without reference to it ?

This ColumnConfigParser instance is created to parse column configs and apply to propsBuilder

I know it's probably not a big deal, but I mean could new instances be avoided ?

The parser is designed to be used in this way, just like we need creating builder instances to build objects.

jshmchenxi · 2022-05-21T04:27:52Z

Close in favor of #4831

github-actions bot added build core docs parquet labels May 27, 2021

jshmchenxi commented May 27, 2021

View reviewed changes

versions.props Outdated Show resolved Hide resolved

jshmchenxi commented May 27, 2021

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java Show resolved Hide resolved

jshmchenxi mentioned this pull request May 27, 2021

Core: Support reading parquet bloom filter #2643

Closed

jshmchenxi mentioned this pull request Jun 15, 2021

Add support for Parquet BloomFilter #2582

Closed

openinx reviewed Jul 12, 2021

View reviewed changes

jshmchenxi force-pushed the bloom-filter-write branch from c8a6a1f to 61de602 Compare July 12, 2021 15:24

jshmchenxi force-pushed the bloom-filter-write branch from 61de602 to c398226 Compare November 3, 2021 06:31

flyrain reviewed Nov 9, 2021

View reviewed changes

RussellSpitzer reviewed Nov 16, 2021

View reviewed changes

Xi Chen added 2 commits November 21, 2021 15:25

Core: Support writing parquet bloom filter

1d773e2

Avoid using ColumnChunkPageWriteStore class directly

575f825

jshmchenxi force-pushed the bloom-filter-write branch 2 times, most recently from 8752f59 to 63c7acf Compare November 21, 2021 07:51

Rename properties

6eb73d3

jshmchenxi force-pushed the bloom-filter-write branch from 63c7acf to 6eb73d3 Compare November 21, 2021 07:55

Add TODO for ParquetWriteBuilder

49324c8

jshmchenxi requested a review from RussellSpitzer November 23, 2021 05:42

moon-fall mentioned this pull request Jan 20, 2022

When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory. #3909

Closed

zinking reviewed Feb 7, 2022

View reviewed changes

Zhangg7723 mentioned this pull request May 19, 2022

[FEATURE REQUEST] The Bloom Filter for Parquet formats is necessary #4813

Closed

jshmchenxi closed this May 21, 2022

		Preconditions.checkState(pageStore instanceof BloomFilterWriteStore,
		"pageStore must be an instance of BloomFilterWriteStore");

Core: Support writing parquet bloom filter #2642

Core: Support writing parquet bloom filter #2642

Uh oh!

Conversation

jshmchenxi commented May 27, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jshmchenxi Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Oct 29, 2021

Uh oh!

huaxingao commented Nov 2, 2021

Uh oh!

ConeyLiu commented Nov 2, 2021

Uh oh!

jshmchenxi commented Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huaxingao commented Nov 3, 2021

Uh oh!

flyrain Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jshmchenxi Jul 12, 2021 •

edited

Loading

jshmchenxi commented Nov 3, 2021 •

edited

Loading

flyrain Nov 9, 2021 •

edited

Loading

flyrain Nov 9, 2021 •

edited

Loading

jshmchenxi Feb 9, 2022 •

edited

Loading