Skip to content

Conversation

@jshmchenxi
Copy link
Contributor

@jshmchenxi jshmchenxi commented May 11, 2021

For #2391, add Parquet BloomFilter support to Iceberg.
Upgrade Parquet version to 1.12.0 and add ParquetBloomRowGroupFilter similar to ParquetDictionaryRowGroupFilter.

ExpressionVisitor is implemented with refer to org.apache.parquet.filter2.bloomfilterlevel.BloomFilterImpl.
BloomFilter is helpful only with eq() and in() expression. It can not help filtering rows with other expressions like gt() or notEq().

Add 3 new properties to TableProperties. The definition is similar to apache/parquet-mr

  • write.parquet.bloom-filter-enabled
  • write.parquet.bloom-filter-max-bytes
  • write.parquet.bloom-filter-expected-ndv

@jshmchenxi
Copy link
Contributor Author

@chenjunjiedada Hi, would you please help review this patch? Thanks!

optional(12, "_all_nans", DoubleType.get()),
optional(13, "_some_nans", FloatType.get()),
optional(14, "_no_nans", DoubleType.get()),
optional(15, "_struct_not_null", _structFieldType),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test for this column using an in or equals predicate to ensure that the existence of a bloom filter on a file for a query against a field that doesn't work for the bloom filter doesn't throw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll try later. Thanks for the review!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need a fully covered test which address all the data types from Type.java, the selected Integer, Double, String, Float are not enough.

@rdblue
Copy link
Contributor

rdblue commented May 20, 2021

@jshmchenxi, I think this should be done in several PRs instead of one. First, we would need to update the Parquet version, then we would want to add read support and finally we would add write support. That will help keep the changes to a size where reviewers can get through them in a reasonable amount of time.

I also think that we need to more carefully consider how to configure Parquet's bloom filters. I would expect what you've added here as table properties to be column specific. Why did you choose global settings. Does this create a bloom filter with the same NDV for all columns?

@kbendick
Copy link
Contributor

@jshmchenxi, I think this should be done in several PRs instead of one. First, we would need to update the Parquet version, then we would want to add read support and finally we would add write support. That will help keep the changes to a size where reviewers can get through them in a reasonable amount of time

Agreed on parquet versions. With the number of supported spark versions, it would be difficult to bring up parquet 1.12 (as great as it is) without some consideration by major stakeholders.

@jshmchenxi
Copy link
Contributor Author

Thanks for the suggesion, I'll split this into several PRs. @rdblue @kbendick

| write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size |
| write.parquet.compression-codec | gzip | Parquet compression codec |
| write.parquet.compression-level | null | Parquet compression level |
| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; It is also possible to enable it for some columns by specifying the column name within the property followed by # |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that we need to more carefully consider how to configure Parquet's bloom filters. I would expect what you've added here as table properties to be column specific. Why did you choose global settings. Does this create a bloom filter with the same NDV for all columns?

@rdblue Yes, write.parquet.bloom-filter-enabled and write.parquet.bloom-filter-expected-ndv both support column specific settings. We can set write.parquet.bloom-filter-enabled#user_id=true and write.parquet.bloom-filter-expected-ndv#user_id=1000 to just enable bloom filter for column user_id with NDV 1000.
I'll make the doc more complete in new PRs.

* TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.
* Parses the specified key-values in the format of root.key#column.path from a {@link Configuration} object.
*/
class ColumnConfigParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg doesn't use the same names that Parquet would, and it also doesn't use a Configuration to store properties. We need to think about what would make sense for Iceberg here, and using # to delimit properties is probably too confusing.

I think that the properties proposed in this PR for global defaults make sense, like write.parquet.bloom-filter-enabled, although the NDV default is probably not useful given that we expect NDV to vary widely across fields. For the column-specific settings, I think we may want to follow the same pattern that is used by metrics collection. That embeds the column name in the property, like write.metadata.metrics.column.col1. This could be write.parquet.bloom-filter.col1.enabled or write.parquet.bloom-filter.col1.max-bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will try to change the configuration pattern when I have time.

this.recordCount = 0;

PageWriteStore pageStore = pageStoreCtorParquet.newInstance(
ColumnChunkPageWriteStore pageStore = pageStoreCtorParquet.newInstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there write-side changes in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have split this into 2 PRs:
Core: Support writing parquet bloom filter #2642
Core: Support reading parquet bloom filter #2643

.build()) {
GenericRecordBuilder builder = new GenericRecordBuilder(convert(FILE_SCHEMA, "table"));
// create 50 records
for (int i = 0; i < INT_VALUE_COUNT; i += 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use the org.apache.iceberg.data.RandomGenericData#generate to generate random Record for testing purpose because it could cover almost all the corner cases that will encounter in the real production (Actually, I detected several bugs when I use the RandomGenericData to mock data and run unit tests). I think we could also use it here. For example, we may generate several records into a collection and then check whether all the values from the given column are shown positive in the parquet bloom filter binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openinx Thanks! That's a good idea. I'll try using the RandomGenericData utility to generate test cases.

@jshmchenxi
Copy link
Contributor Author

Close in favor of #4831

@jshmchenxi jshmchenxi closed this May 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants