-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Support reading Parquet row group bloom filter #4938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]> Co-authored-by: Hao Lin <[email protected]>
|
I don't have a good way to test the read path, but I have tested these changes with the write path on my local. |
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.iceberg.parquet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to include the unit test in this PR? I think all you'd need to do is configure the bloom filter settings using the Parquet settings in a Hadoop Configuration rather than through the Iceberg write settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to config using ParquetOutputFormat.BLOOM_FILTER_ENABLED by replacing this line with .set(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#_id", "true") , but it doesn't work.
Seems Iceberg only honors the Iceberg's properties (which are set by Context at here), but it doesn't really take the properties set by Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that properties will be passed through to the Hadoop Configuration automatically. Is that no longer true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue Thanks for your quick reply!
Seems the properties need to be passed to InternalParquetRecordWriter through this encodingProps. We need to call the withXXX method explicitly e.g. withDictionaryPageSize to set the property to encodingPropsBuilder
So it seems to me that we have to call this withBloomFilterEnabled explicitly to set the bloom filter property toencodingPropsBuilder. Otherwise, Parquet's InternalParquetRecordWriter won't be able to take it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I set the Parquet properties to encodingPropsBuilder? If the same properties is also set by iceberg properties, I will reset to overwrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Is it possible to get some tests working without the write-side changes? Maybe write a Parquet file directly and use name mapping? If not then let's try to make the minimal write-side changes to get the test in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some minimal write-side changes to get the test in. Hope this is OK.
| } | ||
|
|
||
| public static Set<Integer> references( | ||
| StructType struct, List<Expression> exprs, boolean caseSensitive, boolean alreadyBound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we detect that an expression is unbound? Maybe identify named refs and return? Then we could just use one method for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can i just use instanceof Unbound to detect if an expression is unbound? Something like this:
private static boolean isUnbound(Expression expr) {
switch (expr.op()) {
case TRUE:
return false;
case FALSE:
return false;
case NOT:
Not not = (Not) expr;
return isUnbound(not.child());
case AND:
And and = (And) expr;
return isUnbound(and.left()) || isUnbound(and.right());
case OR:
Or or = (Or) expr;
return isUnbound(or.left()) || isUnbound(or.right());
default:
return expr instanceof Unbound;
}
}
Then the method boundReferences can be
public static Set<Integer> boundReferences(StructType struct, List<Expression> exprs, boolean caseSensitive) {
if (exprs == null) {
return ImmutableSet.of();
}
ReferenceVisitor visitor = new ReferenceVisitor();
for (Expression expr : exprs) {
if (isUnbound(expr)) {
ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
} else {
ExpressionVisitors.visit(expr, visitor);
}
}
return visitor.references;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all expressions are bound or unbound. What about just trying to bind the expression and catching the exception that is thrown if it's already bound? Then you can just move on and return the references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
| * @return false if the file cannot contain rows that match the expression, true otherwise. | ||
| */ | ||
| public boolean shouldRead(MessageType fileSchema, BlockMetaData rowGroup, | ||
| BloomFilterReader bloomReader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: the starting point of all argument lines should be aligned. That can be at 2 indents from the start of the method OR after the opening ( to start method arguments. There should not be two different indentation levels like you have here (both after ( and 2 indents from the method definition line).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder. I will pay attention to the style next time.
| try { | ||
| ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor); | ||
| } catch (IllegalStateException e) { | ||
| if (e.getMessage().contains("Found already bound predicate")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't rely on exception messages like this.
I think instead we should just add a utility to detect whether an expression is bound. Here's an implementation:
/**
* Returns whether an expression is bound.
* <p>
* An expression is bound if all of its predicates are bound.
*
* @param expr an {@link Expression}
* @return true if the expression is bound
* @throws IllegalArgumentException if the expression has both bound and unbound predicates.
*/
public static boolean isBound(Expression expr) {
Boolean isBound = ExpressionVisitors.visit(expr, new IsBoundVisitor());
return isBound != null ? isBound : false; // assume unbound if undetermined
}
private static class IsBoundVisitor extends ExpressionVisitors.ExpressionVisitor<Boolean> {
@Override
public Boolean not(Boolean result) {
return result;
}
@Override
public Boolean and(Boolean leftResult, Boolean rightResult) {
return combineResults(leftResult, rightResult);
}
@Override
public Boolean or(Boolean leftResult, Boolean rightResult) {
return combineResults(leftResult, rightResult);
}
@Override
public <T> Boolean predicate(BoundPredicate<T> pred) {
return true;
}
@Override
public <T> Boolean predicate(UnboundPredicate<T> pred) {
return false;
}
private Boolean combineResults(Boolean isLeftBound, Boolean isRightBound) {
if (isLeftBound != null) {
Preconditions.checkArgument(isRightBound == null || isLeftBound.equals(isRightBound),
"Found partially bound expression");
return isLeftBound;
} else {
return isRightBound;
}
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great! Thank you very much for the implementation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Thanks for all your work on this.
| .build(), | ||
| metricsConfig); | ||
| .withDictionaryPageSize(dictionaryPageSize); | ||
| // Todo: The following code needs to be improved in the bloom filter write path PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay for now to get this in with tests. Thanks, @huaxingao!
|
Thanks, @huaxingao! |
|
Thank you very much @rdblue! Also thank you @RussellSpitzer @kbendick @chenjunjiedada @hililiwei @stevenzwu |
|
Thank you for making support for reading Parquet bloom filters happen @huaxingao! Your work is highly appreciated. This will be very valuable for a number of use cases. |
Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>
Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>
Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>
Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>
Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>
Co-authored-by: Xi Chen <[email protected]> Co-authored-by: Hao Lin <[email protected]>
Co-Authored-By: Xi Chen [email protected]
Co-Authored-By: Hao Lin [email protected]
Co-Authored-By: Huaxin Gao [email protected]
This is the read path of parquet row group bloom filter. The original PR is here