Add support for Parquet BloomFilter #2582

jshmchenxi · 2021-05-11T12:45:38Z

For #2391, add Parquet BloomFilter support to Iceberg.
Upgrade Parquet version to 1.12.0 and add ParquetBloomRowGroupFilter similar to ParquetDictionaryRowGroupFilter.

ExpressionVisitor is implemented with refer to org.apache.parquet.filter2.bloomfilterlevel.BloomFilterImpl.
BloomFilter is helpful only with eq() and in() expression. It can not help filtering rows with other expressions like gt() or notEq().

Add 3 new properties to TableProperties. The definition is similar to apache/parquet-mr

write.parquet.bloom-filter-enabled
write.parquet.bloom-filter-max-bytes
write.parquet.bloom-filter-expected-ndv

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

jshmchenxi · 2021-05-12T01:36:53Z

@chenjunjiedada Hi, would you please help review this patch? Thanks!

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

kbendick · 2021-05-13T21:01:48Z

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

+      optional(12, "_all_nans", DoubleType.get()),
+      optional(13, "_some_nans", FloatType.get()),
+      optional(14, "_no_nans", DoubleType.get()),
+      optional(15, "_struct_not_null", _structFieldType),


Can you please add a test for this column using an in or equals predicate to ensure that the existence of a bloom filter on a file for a query against a field that doesn't work for the bloom filter doesn't throw?

OK. I'll try later. Thanks for the review!

I think we will need a fully covered test which address all the data types from Type.java, the selected Integer, Double, String, Float are not enough.

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

rdblue · 2021-05-20T00:02:59Z

@jshmchenxi, I think this should be done in several PRs instead of one. First, we would need to update the Parquet version, then we would want to add read support and finally we would add write support. That will help keep the changes to a size where reviewers can get through them in a reasonable amount of time.

I also think that we need to more carefully consider how to configure Parquet's bloom filters. I would expect what you've added here as table properties to be column specific. Why did you choose global settings. Does this create a bloom filter with the same NDV for all columns?

kbendick · 2021-05-22T01:05:07Z

@jshmchenxi, I think this should be done in several PRs instead of one. First, we would need to update the Parquet version, then we would want to add read support and finally we would add write support. That will help keep the changes to a size where reviewers can get through them in a reasonable amount of time

Agreed on parquet versions. With the number of supported spark versions, it would be difficult to bring up parquet 1.12 (as great as it is) without some consideration by major stakeholders.

jshmchenxi · 2021-05-26T02:38:01Z

Thanks for the suggesion, I'll split this into several PRs. @rdblue @kbendick

jshmchenxi · 2021-05-26T02:40:44Z

site/docs/configuration.md

 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; It is also possible to enable it for some columns by specifying the column name within the property followed by # |


I also think that we need to more carefully consider how to configure Parquet's bloom filters. I would expect what you've added here as table properties to be column specific. Why did you choose global settings. Does this create a bloom filter with the same NDV for all columns?

@rdblue Yes, write.parquet.bloom-filter-enabled and write.parquet.bloom-filter-expected-ndv both support column specific settings. We can set write.parquet.bloom-filter-enabled#user_id=true and write.parquet.bloom-filter-expected-ndv#user_id=1000 to just enable bloom filter for column user_id with NDV 1000.
I'll make the doc more complete in new PRs.

rdblue · 2021-06-12T23:47:47Z

parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java

+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.
+ * Parses the specified key-values in the format of root.key#column.path from a {@link Configuration} object.
+ */
+class ColumnConfigParser {


Iceberg doesn't use the same names that Parquet would, and it also doesn't use a Configuration to store properties. We need to think about what would make sense for Iceberg here, and using # to delimit properties is probably too confusing.

I think that the properties proposed in this PR for global defaults make sense, like write.parquet.bloom-filter-enabled, although the NDV default is probably not useful given that we expect NDV to vary widely across fields. For the column-specific settings, I think we may want to follow the same pattern that is used by metrics collection. That embeds the column name in the property, like write.metadata.metrics.column.col1. This could be write.parquet.bloom-filter.col1.enabled or write.parquet.bloom-filter.col1.max-bytes.

Ok, I will try to change the configuration pattern when I have time.

rdblue · 2021-06-12T23:52:54Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

    this.recordCount = 0;

-    PageWriteStore pageStore = pageStoreCtorParquet.newInstance(
+    ColumnChunkPageWriteStore pageStore = pageStoreCtorParquet.newInstance(


Why are there write-side changes in this PR?

Hi, I have split this into 2 PRs:
Core: Support writing parquet bloom filter #2642
Core: Support reading parquet bloom filter #2643

openinx · 2021-07-09T09:22:36Z

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

+        .build()) {
+      GenericRecordBuilder builder = new GenericRecordBuilder(convert(FILE_SCHEMA, "table"));
+      // create 50 records
+      for (int i = 0; i < INT_VALUE_COUNT; i += 1) {


We usually use the org.apache.iceberg.data.RandomGenericData#generate to generate random Record for testing purpose because it could cover almost all the corner cases that will encounter in the real production (Actually, I detected several bugs when I use the RandomGenericData to mock data and run unit tests). I think we could also use it here. For example, we may generate several records into a collection and then check whether all the values from the given column are shown positive in the parquet bloom filter binary.

@openinx Thanks! That's a good idea. I'll try using the RandomGenericData utility to generate test cases.

jshmchenxi · 2022-05-21T04:27:28Z

Close in favor of #4831

github-actions bot added build core docs parquet labels May 11, 2021

dixingxing0 reviewed May 11, 2021

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java Outdated Show resolved Hide resolved

jshmchenxi commented May 11, 2021

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java Show resolved Hide resolved

jshmchenxi force-pushed the bloom-filter branch from 30a6912 to da30157 Compare May 11, 2021 13:05

Add support for Parquet BloomFilter

aae7e96

jshmchenxi force-pushed the bloom-filter branch from da30157 to aae7e96 Compare May 11, 2021 13:15

kbendick reviewed May 13, 2021

View reviewed changes

jshmchenxi commented May 26, 2021

View reviewed changes

This was referenced May 26, 2021

Move to Parquet 1.12.0 #2441

Merged

Core: Support writing parquet bloom filter #2642

Closed

Core: Support reading parquet bloom filter #2643

Closed

rdblue reviewed Jun 12, 2021

View reviewed changes

openinx reviewed Jul 9, 2021

View reviewed changes

Zhangg7723 mentioned this pull request May 19, 2022

[FEATURE REQUEST] The Bloom Filter for Parquet formats is necessary #4813

Closed

jshmchenxi closed this May 21, 2022

Add support for Parquet BloomFilter #2582

Add support for Parquet BloomFilter #2582

Uh oh!

Conversation

jshmchenxi commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jshmchenxi commented May 12, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue commented May 20, 2021

Uh oh!

kbendick commented May 22, 2021

Uh oh!

jshmchenxi commented May 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jshmchenxi commented May 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jshmchenxi commented May 11, 2021 •

edited

Loading