-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Add support for struct field based filtering #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for struct field based filtering #123
Conversation
api/src/main/java/com/netflix/iceberg/expressions/UnboundPredicate.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
| } | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method be returning a null value in case none of the nested fields of this struct match by field id?
Instead of an instance of BoundReference with a null private attribute type would it be acceptable that we instead throw new ValidationException("Cannot find nested field id %d in struct: %s", fieldId, struct); ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java
Outdated
Show resolved
Hide resolved
|
Thank you @prodeezy for contributing this, it looks really helpful! |
|
General API design related question - how do we bubble up in the API layer the fact that we are supporting the struct field based filtering but not other nested field types filtering? I am not referring to the implementation details of exposing this in the API, I am just curious what a client's expectations should be that the API exposes this new capacity in a coherent manner. |
…ding nested field in UnboundPredicate.bind
@fbocse Currently it would be implied. The only feedback the client gets right now is in the Physical plan, that this filter is pushed down to Iceberg Scan level and if the client inspects the Scan (using |
api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java
Outdated
Show resolved
Hide resolved
| boolean isNestedFieldExp = expressionFieldPath.indexOf('.') > -1; | ||
|
|
||
| field = isNestedFieldExp ? findNestedField(struct, expressionFieldPath, caseSensitive) : | ||
| caseSensitive ? struct.field(ref().name()) : struct.caseInsensitiveField(ref().name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: can we reuse expressionFieldPath instead of obtaining ref().name() again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| this.fieldId = fieldId; | ||
| this.pos = find(fieldId, struct); | ||
| this.type = struct.fields().get(pos).type(); | ||
| this.pos = findTopFieldPos(fieldId, struct); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to have a pos = -1?
I ask because although this field is private, we do expose it via pos() and toString() methods, so we may need to populate this field with the actual position of the matched inner struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. going to write some unit tests to evaluate impact. will handle accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handled this using accessors
api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java
Outdated
Show resolved
Hide resolved
|
|
||
| return subField.field(lastFieldInPath); | ||
|
|
||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we need to recurse, I wonder if it makes sense to reuse the indexes available in TypeUtils: https://github.com/apache/incubator-iceberg/blob/0c9c63140e838875dc8cc52a57be2c8f24ad9975/api/src/main/java/com/netflix/iceberg/types/TypeUtil.java#L78-L84 so that this code simplifies to:
Integer idx = indexByName.get(expressionFieldPath);
Types.NestedField field = indexById.get(idx);That code doesn't check for non-struct parents; we'd have to perhaps create a custom Indexer. Also we would need to calculate the index lazily and perhaps cache it ( As in the Schema, see https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/com/netflix/iceberg/Schema.java#L59-L71 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like TypeUtils.indexByName & TypeUtils.indexById are used when reading data. It's used by the Manifest reader to index on manifest file statistics data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are suggesting we now start using this. i dunno what impact that would have though. do you see a major benefit to it? I can look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are suggesting we now start using this.
Right.
do you see a major benefit to it?
Code reuse. But it would only make sense if findNestedField() is called many times. The first call to findNestedField() would build the indexes and consult them. Then any subsequent call to findNestedField() would just consult the cached Maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. seems like it's being called many times but by different expression evaluators. A cache to lookup field name to Types.NestedField would help. Good call. will add one in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xabriel I'm now creating an index that keeps ids to accessors and then use accessors to reach type and fields. access to this index is lazy as well.
|
Added struct field based unit testing to TestDictionaryRowGroupFilter & TestMetricsRowGroupFilter. |
|
@rdblue thoughts? This PR should be ready for full review. I'v added tests and comments where possible. Addressed most PR comments. |
|
@prodeezy, thanks for working on this! I'll review it as soon as I get time this week. |
|
@aokolnychyi would be good to get your thoughts on this since you did all the upstream work to enable this :-) |
|
Looks like pull#138 introduced conflicts. Will rebase with latest on master and push. |
done. |
|
@prodeezy, this is looking good. I think you have the right approach, but some things are missing. With this patch,
This should also bind differently. We never parse identifiers in Iceberg. Instead of parsing, we build indexes using the possible field names. That way, we never have to worry about quoting. For example, if a user passes in an expression for Expression binding should use an index like the ones used by |
…x using BuildPositionAccessors
| field = isNestedFieldExp ? findNestedField(struct, expressionFieldPath, caseSensitive) : | ||
| caseSensitive ? struct.field(expressionFieldPath) : struct.caseInsensitiveField(ref().name()); | ||
| Schema schema = new Schema(struct.fields()); | ||
| Types.NestedField field = schema.findField(caseSensitive? ref().name(): ref().name().toLowerCase()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably use a case insensitive version of findField instead of assuming that passing lower-case in will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
|
@rdblue Is there a way to test the Evaluator as an end to end test? I haven't been able to test that part out properly with my conventional gist examples. |
|
@prodeezy, |
|
Added tests for Evaluator and addressed other review comments. |
|
@rdblue Thanks for taking a detailed look! I think i'v addressed the review comments and added pending tests. |
api/src/main/java/org/apache/iceberg/expressions/BoundReference.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| public int pos() { | ||
| if (pos == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to me to throw a ValidationException in some cases. It is likely that code paths will not know whether the bound expression is for a or a.b, but the latter will fail here, during evaluation instead of during binding.
Could the evaluator that uses pos be updated so that this accessor could be used? Maybe PositionAccessor could support accessing summaries from List. That would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some clarification here .. The evaluator in question is the InclusiveManifestEvaluator which evaluates partition fields for matching manifests. So the ref.pos() is going to be on partition fields. Afaik partition stats are kept separately in snapshot files and regular fields wont show up in this partition summary list. Is this an issue when filtering on a nested field in the schema? If so, Is the partition source id used always to reference that field in schema? Can I assume this for logical partitions as well?
e.g. This is a partition summary in snapshot which is used in the said evaluator. This is separate from the stats kept on data schema fields.
"partitions": {
"array": [
{
"contains_null": true,
"lower_bound": {
"bytes": "\u0013\u0000\u0000\u0000"
},
"upper_bound": {
"bytes": "\u001e\u0000\u0000\u0000"
}
}
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're saying that this is technically safe, and that's correct. The only code path that calls this that evaluator and it is always binding to a flat partition structure. That's why it can use the position: it knows that the array of partition summaries is in the same order as a tuple of partition values.
My point here is that it is brittle to bind to a struct type and use the position for something else, and also that it is a bad API to expose the position when no normal path uses the position directly. Instead, maybe that evaluator should get this position from the first accessor. That way, it validates that the partition field is not nested (should be a single position accessor). My original thought was to add a method to the accessor that can return one of the partition summaries from a list. That would work, too, but requires another accessor method so it isn't a great idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. So i'm now using PositionAccessor to fetch position during manifest evaluation. Throwing an error if it's a different kind of accessor. Took out the pos() api from BoundReference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. This also cleared out usage of schema in BoundReference
|
@prodeezy, are you still working on this? I'd like to get it in before we release. |
|
@rdblue sorry bout the delay. I'l have the pending comments addressed in a day or two. |
|
No problem! Just let me know if you'd like me to help out. |
|
@rdblue addressed all but one comments. Need some clarification/guidance on evaluating position for the bound exp reference. |
|
I took another glance. Think i'v addressed all pending comments. @rdblue |
|
@rdblue gentle reminder. Lemme know if this looks ok to you. |
|
Thanks for the reminder. I should have time to review it this week. Sorry for the delay. |
api/src/main/java/org/apache/iceberg/expressions/BoundReference.java
Outdated
Show resolved
Hide resolved
|
@prodeezy, I thought that it would be easier if I made a few minor changes since it would take longer to ask you to do them than to just move a few things around. I opened a PR against your branch: prodeezy#1 If you agree with those changes, just merge the PR and push and I'll commit this. Thanks! |
|
@prodeezy, the test failures are because I removed the map column from |
|
Fixed test. |
|
Merged! Thanks for fixing this @prodeezy! |
Addresses Issue#122
Update:
This PR should be ready for full review. I'v added tests and comments where possible. Addressed most PR comments.
A basic test run with this change run on pre-generated Iceberg formatted data containing struct level metrics (Run with Spark patched to push struct filters down to Iceberg) :
Gist to create data : https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac
/cc @rdblue @aokolnychyi @xabriel @fbocse