Support reading Parquet row group bloom filter #4938

huaxingao · 2022-06-02T02:55:25Z

Co-Authored-By: Xi Chen [email protected]
Co-Authored-By: Hao Lin [email protected]
Co-Authored-By: Huaxin Gao [email protected]

This is the read path of parquet row group bloom filter. The original PR is here

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]> Co-authored-by: Hao Lin <[email protected]>

huaxingao · 2022-06-02T18:06:28Z

I don't have a good way to test the read path, but I have tested these changes with the write path on my local.

huaxingao · 2022-06-02T18:09:54Z

cc @rdblue @RussellSpitzer @kbendick @chenjunjiedada @hililiwei

rdblue · 2022-06-03T14:55:47Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;


Is it possible to include the unit test in this PR? I think all you'd need to do is configure the bloom filter settings using the Parquet settings in a Hadoop Configuration rather than through the Iceberg write settings.

I tried to config using ParquetOutputFormat.BLOOM_FILTER_ENABLED by replacing this line with .set(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#_id", "true") , but it doesn't work.

Seems Iceberg only honors the Iceberg's properties (which are set by Context at here), but it doesn't really take the properties set by Parquet.

I think that properties will be passed through to the Hadoop Configuration automatically. Is that no longer true?

@rdblue Thanks for your quick reply!

Seems the properties need to be passed to InternalParquetRecordWriter through this encodingProps. We need to call the withXXX method explicitly e.g. withDictionaryPageSize to set the property to encodingPropsBuilder

So it seems to me that we have to call this withBloomFilterEnabled explicitly to set the bloom filter property toencodingPropsBuilder. Otherwise, Parquet's InternalParquetRecordWriter won't be able to take it.

Should I set the Parquet properties to encodingPropsBuilder? If the same properties is also set by iceberg properties, I will reset to overwrite.

I see. Is it possible to get some tests working without the write-side changes? Maybe write a Parquet file directly and use name mapping? If not then let's try to make the minimal write-side changes to get the test in.

I made some minimal write-side changes to get the test in. Hope this is OK.

rdblue · 2022-06-03T14:56:38Z

api/src/main/java/org/apache/iceberg/expressions/Binder.java

+  }
+
+  public static Set<Integer> references(
+      StructType struct, List<Expression> exprs, boolean caseSensitive, boolean alreadyBound) {


Can we detect that an expression is unbound? Maybe identify named refs and return? Then we could just use one method for everything.

Can i just use instanceof Unbound to detect if an expression is unbound? Something like this:

private static boolean isUnbound(Expression expr) { switch (expr.op()) { case TRUE: return false; case FALSE: return false; case NOT: Not not = (Not) expr; return isUnbound(not.child()); case AND: And and = (And) expr; return isUnbound(and.left()) || isUnbound(and.right()); case OR: Or or = (Or) expr; return isUnbound(or.left()) || isUnbound(or.right()); default: return expr instanceof Unbound; } }

Then the method boundReferences can be

public static Set<Integer> boundReferences(StructType struct, List<Expression> exprs, boolean caseSensitive) { if (exprs == null) { return ImmutableSet.of(); } ReferenceVisitor visitor = new ReferenceVisitor(); for (Expression expr : exprs) { if (isUnbound(expr)) { ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor); } else { ExpressionVisitors.visit(expr, visitor); } } return visitor.references; }

Not all expressions are bound or unbound. What about just trying to bind the expression and catching the exception that is thrown if it's already bound? Then you can just move on and return the references.

Sounds good to me!

rdblue · 2022-06-03T15:00:30Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

+   * @return false if the file cannot contain rows that match the expression, true otherwise.
+   */
+  public boolean shouldRead(MessageType fileSchema, BlockMetaData rowGroup,
+      BloomFilterReader bloomReader) {


Style nit: the starting point of all argument lines should be aligned. That can be at 2 indents from the start of the method OR after the opening ( to start method arguments. There should not be two different indentation levels like you have here (both after ( and 2 indents from the method definition line).

Thanks for the reminder. I will pay attention to the style next time.

rdblue · 2022-06-12T18:08:09Z

api/src/main/java/org/apache/iceberg/expressions/Binder.java

+      try {
+        ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
+      } catch (IllegalStateException e) {
+        if (e.getMessage().contains("Found already bound predicate")) {


We shouldn't rely on exception messages like this.

I think instead we should just add a utility to detect whether an expression is bound. Here's an implementation:

/** * Returns whether an expression is bound. * <p> * An expression is bound if all of its predicates are bound. * * @param expr an {@link Expression} * @return true if the expression is bound * @throws IllegalArgumentException if the expression has both bound and unbound predicates. */ public static boolean isBound(Expression expr) { Boolean isBound = ExpressionVisitors.visit(expr, new IsBoundVisitor()); return isBound != null ? isBound : false; // assume unbound if undetermined } private static class IsBoundVisitor extends ExpressionVisitors.ExpressionVisitor<Boolean> { @Override public Boolean not(Boolean result) { return result; } @Override public Boolean and(Boolean leftResult, Boolean rightResult) { return combineResults(leftResult, rightResult); } @Override public Boolean or(Boolean leftResult, Boolean rightResult) { return combineResults(leftResult, rightResult); } @Override public <T> Boolean predicate(BoundPredicate<T> pred) { return true; } @Override public <T> Boolean predicate(UnboundPredicate<T> pred) { return false; } private Boolean combineResults(Boolean isLeftBound, Boolean isRightBound) { if (isLeftBound != null) { Preconditions.checkArgument(isRightBound == null || isLeftBound.equals(isRightBound), "Found partially bound expression"); return isLeftBound; } else { return isRightBound; } } }

Sounds great! Thank you very much for the implementation!

No problem! Thanks for all your work on this.

rdblue · 2022-06-12T18:10:35Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

-            .build(),
-            metricsConfig);
+            .withDictionaryPageSize(dictionaryPageSize);
+        // Todo: The following code needs to be improved in the bloom filter write path PR.


I think this is okay for now to get this in with tests. Thanks, @huaxingao!

rdblue · 2022-06-13T02:12:19Z

Thanks, @huaxingao!

huaxingao · 2022-06-13T03:26:28Z

Thank you very much @rdblue! Also thank you @RussellSpitzer @kbendick @chenjunjiedada @hililiwei @stevenzwu

kbendick · 2022-06-13T03:28:19Z

Thank you for making support for reading Parquet bloom filters happen @huaxingao! Your work is highly appreciated. This will be very valuable for a number of use cases.

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>

github-actions bot added API parquet labels Jun 2, 2022

Support reading Parquet row group bloom filter

4ac3d76

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]> Co-authored-by: Hao Lin <[email protected]>

huaxingao force-pushed the bf_read branch from 727210c to 4ac3d76 Compare June 2, 2022 04:27

huaxingao added 3 commits June 1, 2022 21:33

Trigger Build

e37af04

Trigger Build

27f2ad8

Trigger Build

b03e636

rdblue reviewed Jun 3, 2022

View reviewed changes

address comments

34bb99a

github-actions bot added the core label Jun 9, 2022

rdblue reviewed Jun 12, 2022

View reviewed changes

add IsBoundVisitor to check if an expression is bound

ce0a26f

rdblue approved these changes Jun 12, 2022

View reviewed changes

rdblue merged commit 87242c0 into apache:master Jun 13, 2022

huaxingao deleted the bf_read branch June 13, 2022 03:26

kbendick mentioned this pull request Jul 6, 2022

Integrate Parquet bloomfilter feature #2391

Closed

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Parquet: Support row group bloom filters (apache#4938)

977453a

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Parquet: Support row group bloom filters (apache#4938)

c72a15a

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 18, 2022

Parquet: Support row group bloom filters (apache#4938)

ebac219

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 18, 2022

Parquet: Support row group bloom filters (apache#4938)

1fea542

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 20, 2022

Parquet: Support row group bloom filters (apache#4938)

370500e

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 25, 2022

Parquet: Support row group bloom filters (apache#4938)

d7feb12

Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>

Support reading Parquet row group bloom filter #4938

Support reading Parquet row group bloom filter #4938

Uh oh!

Conversation

huaxingao commented Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huaxingao commented Jun 2, 2022

Uh oh!

huaxingao commented Jun 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 13, 2022

Uh oh!

huaxingao commented Jun 13, 2022

Uh oh!

kbendick commented Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huaxingao commented Jun 2, 2022 •

edited

Loading

kbendick commented Jun 13, 2022 •

edited

Loading