Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Feb 17, 2025

This adds Expressions.extract to extract a value from a variant in Iceberg filters.

The new method, Expressions.extract(column, path, type), accepts a column name, a JSON path, and a type. UnboundExtract is responsible for binding to BoundExtract. Binding the extract term validates that the referenced field is a variant, that the path is valid and supported, and that the type is valid. Binding is tested in TestExpressionBinding.

The new extract expression required extending BoundTerm with a new method, producesNull, to detect when isNull or notNull are determined by the expression. In addition, this PR adds support to handle unknown in binding.

The supported JSON path expressions are very limited. All paths must start with the root ($) and consist of only simple property selection using .name. Using JSON path allows later extension to use quoted name in brackets for field access (and more selection features), but avoids needing to add more complex cases now.

@rdblue rdblue force-pushed the variant-add-extract-expression branch from 820b345 to d4b6e78 Compare February 18, 2025 21:19
Copy link
Contributor

@aihuaxu aihuaxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. LGTM.

switch (op()) {
case IS_NULL:
if (boundTerm.ref().field().isRequired()) {
if (!boundTerm.producesNull()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this change is same as the one in BoundReference (field.isOptional()). For BoundTransform, seems we added !transform.preservesOrder() - which transforms produce nulls when not order-preserving?

// transforms must produce null for null input values
// transforms may produce null for non-null inputs when not order-preserving
return ref.producesNull() || !transform.preservesOrder();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good example is VoidTransform, which maps all values to null. If a transform preserves order then it can't produce a null value for a non-null input because that would violate order preservation. If it does not preserve order, then it could map values to null so we account for that. If a field is required there are no non-null values, so the only way a null could be produced is by the transform.

* <li><code>.name</code> accesses a field by name
* </ul>
*
* <p>If the query result is a list, the value is a variant array of results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only explain for array here? The result can be object / array or primitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array is called out because some JSON path queries produce a list of results. In that case, it is necessary to wrap the results in a variant array to return them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's potential for ambiguity here since the extract could produce a list of matched elements or a list of lists, which would be difficult to distinguish. I believe if we want to be complaint with RFC 9535 we would always return a list.

There are ways to work around this, for example Trino supports syntax for "conditional wrapper" or "unconditional wrapper". They also support separate access methods for query (multiple results) vs value (explicit single result).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this for now and move the JSON path part so that it isn't exposed anywhere. That way we can change it all later.

Copy link
Contributor Author

@rdblue rdblue Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This is now removed and PathUtil is package-private.

private static final Splitter DOT = Splitter.on(".");
private static final String ROOT = "$";

public static List<String> parsePath(String path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we're using a List<String> to handle this as I feel it's going to get much more complicated over time. Why not put this in a wrapper of VariantPath? I think it would be better not to leak this through a public interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think this whole class should just be redefined as VarientPath since that's all it's really doing at this point and then we can build on this going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move this and make it package-private so we don't need to worry about changing it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now package-private so we can modify it later, but my follow-up PR should still work.

@rdblue rdblue force-pushed the variant-add-extract-expression branch from 11bc4b7 to a8a90aa Compare February 21, 2025 00:05
@rdblue rdblue merged commit d4fe23a into apache:main Feb 21, 2025
43 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Feb 21, 2025

Thanks for the reviews, @danielcweeks and @aihuaxu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants