-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API: Move variant to API and add extract expression #12304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0398f11 to
820b345
Compare
api/src/main/java/org/apache/iceberg/expressions/BoundExtract.java
Outdated
Show resolved
Hide resolved
820b345 to
d4b6e78
Compare
aihuaxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. LGTM.
| switch (op()) { | ||
| case IS_NULL: | ||
| if (boundTerm.ref().field().isRequired()) { | ||
| if (!boundTerm.producesNull()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this change is same as the one in BoundReference (field.isOptional()). For BoundTransform, seems we added !transform.preservesOrder() - which transforms produce nulls when not order-preserving?
// transforms must produce null for null input values
// transforms may produce null for non-null inputs when not order-preserving
return ref.producesNull() || !transform.preservesOrder();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good example is VoidTransform, which maps all values to null. If a transform preserves order then it can't produce a null value for a non-null input because that would violate order preservation. If it does not preserve order, then it could map values to null so we account for that. If a field is required there are no non-null values, so the only way a null could be produced is by the transform.
| * <li><code>.name</code> accesses a field by name | ||
| * </ul> | ||
| * | ||
| * <p>If the query result is a list, the value is a variant array of results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only explain for array here? The result can be object / array or primitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array is called out because some JSON path queries produce a list of results. In that case, it is necessary to wrap the results in a variant array to return them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's potential for ambiguity here since the extract could produce a list of matched elements or a list of lists, which would be difficult to distinguish. I believe if we want to be complaint with RFC 9535 we would always return a list.
There are ways to work around this, for example Trino supports syntax for "conditional wrapper" or "unconditional wrapper". They also support separate access methods for query (multiple results) vs value (explicit single result).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this for now and move the JSON path part so that it isn't exposed anywhere. That way we can change it all later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This is now removed and PathUtil is package-private.
| private static final Splitter DOT = Splitter.on("."); | ||
| private static final String ROOT = "$"; | ||
|
|
||
| public static List<String> parsePath(String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that we're using a List<String> to handle this as I feel it's going to get much more complicated over time. Why not put this in a wrapper of VariantPath? I think it would be better not to leak this through a public interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think this whole class should just be redefined as VarientPath since that's all it's really doing at this point and then we can build on this going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move this and make it package-private so we don't need to worry about changing it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now package-private so we can modify it later, but my follow-up PR should still work.
This reverts commit 671b1be.
11bc4b7 to
a8a90aa
Compare
|
Thanks for the reviews, @danielcweeks and @aihuaxu! |
This adds
Expressions.extractto extract a value from a variant in Iceberg filters.The new method,
Expressions.extract(column, path, type), accepts a column name, a JSON path, and a type.UnboundExtractis responsible for binding toBoundExtract. Binding theextractterm validates that the referenced field is a variant, that the path is valid and supported, and that the type is valid. Binding is tested inTestExpressionBinding.The new
extractexpression required extendingBoundTermwith a new method,producesNull, to detect whenisNullornotNullare determined by the expression. In addition, this PR adds support to handleunknownin binding.The supported JSON path expressions are very limited. All paths must start with the root (
$) and consist of only simple property selection using.name. Using JSON path allows later extension to use quoted name in brackets for field access (and more selection features), but avoids needing to add more complex cases now.