Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for alternative argument type support in the protobuf representation (type and option arguments) #161

Merged
merged 2 commits into from
May 20, 2022

Conversation

jvanstraten
Copy link
Contributor

Unless I'm misunderstanding something, it seems like it's currently impossible to call functions that take type arguments, such as the one from functions_cast.yaml:

name: cast
description: Cast one type to another.
impls:
  - args:
      - value: any1
      - type: output
    return: output

since there is no way to pass type literals (copypasting the relevant bits of algebra.proto):

message ScalarFunction {
  // points to a function_anchor defined in this plan
  uint32 function_reference = 1;
  repeated Expression args = 2;
  Type output_type = 3;
}

message Expression {
  oneof rex_type {
    Literal literal = 1;
    FieldReference selection = 2;
    ScalarFunction scalar_function = 3;
    WindowFunction window_function = 5;
    IfThen if_then = 6;
    SwitchExpression switch_expression = 7;
    SingularOrList singular_or_list = 8;
    MultiOrList multi_or_list = 9;
    Enum enum = 10;
    Cast cast = 11;
    Subquery subquery = 12;
  }
}

so this PR adds an expression type to accommodate.

@cpcloud
Copy link
Contributor

cpcloud commented Apr 6, 2022

It feels a bit odd to have a Type in the expression oneof. What would evaluating a value of type Type return?

@jvanstraten
Copy link
Contributor Author

Counterpoint: enum is already only valid when used immediately as a function argument (or at least, I can't think of any other use for it). Having a separate message type for function arguments would feel better to me too, like

message ScalarFunction {
  uint32 function_reference = 1;
  repeated FunctionArgument args = 2;
  Type output_type = 3;
}

message FunctionArgument {
  oneof arg_type {
    Enum enum = 1;
    Type type = 2;
    Expression value = 3;
  }
}

message Expression {
  oneof rex_type {
    Literal literal = 1;
    FieldReference selection = 2;
    ScalarFunction scalar_function = 3;
    WindowFunction window_function = 5;
    IfThen if_then = 6;
    SwitchExpression switch_expression = 7;
    SingularOrList singular_or_list = 8;
    MultiOrList multi_or_list = 9;
    Cast cast = 11;
    Subquery subquery = 12;
  }
}

but that'd be a breaking change.

@jacques-n
Copy link
Contributor

Shoot, not sure how I got this wrong since it is so clearly spelled out on the function definition side.

I think the breaking change is the right one. If the args wasn't a repeated, we could do a clean breaking change using oneof at the top level. Unfortunately, oneof isn't supported for repeated types. I think the right solution is to change to do something similar to:

message ScalarFunction {
  uint32 function_reference = 1;
  repeated Expression deprecated_args = 2 [deprecated = true];  // use FunctionArgument args below moving forward
  Type output_type = 3;
  repeated FunctionArgument args = 4;
}

Note that the change above would is a name-based-json-from-protobuf breaking change. That's because one way protobuf to json works is to use field names as opposed to field ordinals. I think that is okay. I don't want there to be an expectation that protobuf's json format is a stable api, only protobuf binary itself. If people want json stability, they should use the field ordinal json representation (instead of the name based one).

@jvanstraten, want to propose a patch?

Also note that we also need to come up with something to represent the arbitrary option string values (the third type of function arguments)

@jacques-n jacques-n changed the title Add type literal expression Add support for alternative argument type support in the protobuf representation (type and option arguments) Apr 20, 2022
@jvanstraten
Copy link
Contributor Author

This response got very long as I typed it out, so tl;dr:

  • AFAICT specifying enum arguments is already possible (I don't see what else they could be for, there is no enum type in the type system?); and
  • I absolutely agree that using a FunctionArgument message is an objectively better way to describe the intended data structure than what's in the PR right now, but I don't think making breaking changes is a good idea until after there's been an initial release (because Substrait is already being used by a bunch of projects); when
  • it's also possible to patch the hole in a backward-compatible way for now (as proposed in this PR).

But, if you disagree with postponing breaking changes until after the first release, I'll change the PR to use a FunctionArgument message with that deprecation logic.

In case my reasoning isn't clear, the long versions follow.

My understanding of functions right now...

... based on the various bits of documentation, protobuf messages, comments, slack, what Isthmus outputs, etc is:

  • Functions can define any number and combination of required and optional enum parameters, followed by any number and combination of value and type parameters (but enums must come before values/types).
  • The last value parameter can be specified to be variadic, allowing it to match one or more arguments (either with all exactly the same type, or matching its parameter type pattern independently for each instance).
    • I don't think anything states that you can't make type or enum parameters variadic, nor do I think that it breaks anything. It makes very little sense to do it, though. I'd probably emit a warning for it in the validator (haven't implemented function call checking yet).
  • The name of the function used in ExtensionFunction consists of the name of the function definition, followed by a :, followed by an _-separated list of simplified parameter type names, as defined in the table here (note that Isthmus does not implement this table correctly, it uses any with the number still attached and a couple of the types are not abbreviated correctly, like decimal). For the avoidance of doubt, a variadic function like max(T...) would be referred to as max:any and use the same function anchor for both an invocation like max(fp32, fp32) and max(i32, i32, i32, i32), as opposed to using max:fp32_fp32 and max:i32_i32_i32_i32 respectively. Exception: if only a single overload exists for a particular function name, it may also be referred to as just its name, so in this case just max. Matching is case-insensitive, so MAX or MaX:aNy are also permissible (IMO case insensitivity is a terrible idea for anything that isn't human-written and dubious for things that are, but everything else in the spec is explicitly case-insensitive, so I'm assuming this is, too). Thus, this expanded name can be generated from the function definition, and can thus be case-insensitively mapped to its implementation by a consumer; in case of a single overload this map will contain two entries mapping to the same implementation, in all other cases the name is case-insensitively unique.
  • In a function call, the argument list must consist of zero or more Expression.enums, followed by zero or more non-enum expression arguments.
    • The Enums are positionally matched; enough Enums must be specified to supply a value for all required enum parameters, but optional enum parameters need not be specified (strictly speaking I would argue they should be specified, but requiring that would probably break every implementation out there). Optional enum parameters that come before required enum parameters can be skipped using the Enum.unspecified oneof variant. The values passed to each Enum parameter must case-insensitively match one of the variants defined in the function prototype. Enum.unspecified may only be passed to optional enum parameters; doing so is equivalent to passing the first option. Not positionally specifying optional enum parameters at the end of the enum parameter list is also equivalent to passing the first option.
    • The first non-Expression.enum function argument marks the start of the value/type argument list. They are positionally bound to the parameters. All value/type parameters are mandatory, the last of which may match one or more arguments if the function is variadic. Type arguments are specified using [missing], value arguments are specified using any other type of expression.
  • Expression.enum and [missing] are only legal in the context of function arguments.

where:

  • [missing] is the oneof variant I proposed to add;
  • I use "parameter" to refer to the "arguments" in the function prototype and "argument" to refer to the things bound to the parameters in a call.

Please shoot holes in this if I got anything wrong, since getting it exactly right is pretty important for the validator. But, unwritten intentions aside, I'm pretty sure this interpretation works for possible function definitions and invocations, except for functions that require type arguments. Hence the proposed addition, and me moving Expression.enum to FunctionArgument.enum in my previous comment.

I personally think making a breaking change for this right now is a bad idea, because...
  • people are already using Substrait and IMO treating it as a "v1" (for better or for worse);
  • no one is using type arguments yet (since it's impossible to do so), so apparently it's not considered to be an important feature (I, too, only pointed it out because I encountered it while making the validator, I don't see any high-priority use cases for it);
  • it's possible to patch this in a way that's fully backward-compatible, and aligns with the (equally ugly) protobuf implementation of enum arguments (assuming my understanding of that is correct);
  • I understand from @cpcloud that we'll be doing a release Soon™, so I'd say it's better to just start with a release that's backward compatible with what people have been using with as many holes fixed as possible (I don't think I encountered any holes severe enough to not address them before a release while writing the validator, other than this one, fix: add missing switch expression #160, and inaccuracies in the spec itself that I made various PRs for);
  • if we're going to do breaking changes for things that technically work but are just not very nice, there's lots more things I'd want to propose changes for; I've been holding back because of the above and to not flood the repo with issues, but I encountered lots of things like that while making the validator, such as what I described in the last paragraph of this comment. You can grep for FIXME in my validator branch for a more exhaustive list.

Basically, I'd say it's better to bundle breaking changes up via releases than to trickle them into the specification before we have versioning set up.

Non-rhetorical answer to "what'd be the type of Expression.enum or my proposed Expression.type_literal variant outside the context of function arguments?"

I'd say it simply is undefined, because it's illegal to do it. The best example I could come up with for something that looks valid but isn't would be add:i32_i32(if column0 then SILENT else ERROR, column1, column2), but in 99.9% of the cases it's pretty obviously invalid. You could just as well ask what the type of if true then "answer" else 42 is; that also looks valid, and you could argue that it's a string because it would be after naively constant-propagating it. But then what about if column0 then "answer" else 42? Maybe 42 coerces to a string? Then what about if column0 then 42 else "42", is that just always the integer 42 or would it be "42"? In all these cases my answer would again be that it is undefined, because it's illegal.

As for what I mean with "undefined" and "illegal" in practice; well, I'd say a consumer is free to reject a plan which does something that's illegal, or interpret it however it wants, or invoke HALT_AND_CATCH_FIRE for all Substrait should be concerned. It should probably do the former, but that's not up to Substrait to decide; the plan ceases to be a (valid) Substrait plan when it does something illegal. Maybe someone else defines "Superstrait" to "unofficially" extend Substrait while also maintaining compatibility; it'd be pretty easy to do so with protobuf, after all. Later versions of Substrait itself might also extend the spec to make things that were previously illegal legal. Finally, a consumer may not need all information that Substrait requires to be in the plan (like the schema in a read relation for example), and it'd probably be counterproductive to user experience for an engine to reject an incoming plan on a technicality. So, for all of those reasons I'd say it's a bad idea to require consumers to reject invalid plans. It'd also be pretty difficult for them to actually do so, since Substrait is pretty context-sensitive due to the YAML extensions, and a consumer may not have access to all YAML files presented to it (nor would it otherwise need them).

@jacques-n
Copy link
Contributor

I don't think introducing an additional FunctionArgument is a breaking change so I think I'm a little confused by your comment. (Replacing the existing repeated field would be breaking.) What are you saying is breaking exactly?

I'm not inclined to add type to the expression tree as I think that adds more complexity to semantic validation of expressions (already one of the more complex patterns). I think it was a mistake to add enum reflecting back on it. An enum can only go in a function argument position. All other things in expression fit in many places. That was a mistake on my part. I'd be inclined to mark that deprecated as well and keep clean in function argument pattern. Thoughts?

@jvanstraten
Copy link
Contributor Author

Looking at your alternative again, I misinterpreted what "deprecated" means for protobuf and assumed your solution would be a completely breaking change. I guess a producer that wants to be compatible with consumers that haven't been upgraded yet could populate both the Expression list and the FunctionArgument list (assuming the plan doesn't use any function with type arguments), and consumers that want to be compatible with older producers can use the FunctionArgument list if it's populated, and else fall back on the Expression list. As a user of Substrait I'd still much prefer a change that works out of the box, though...

I came up with another non-breaking alternative:

message ScalarFunction {
  uint32 function_reference = 1;
  repeated FunctionArgument args = 2;
  Type output_type = 3;
}

message FunctionArgument {
  oneof arg_type {
    Literal literal = 1;
    FieldReference selection = 2;
    ScalarFunction scalar_function = 3;
    WindowFunction window_function = 5;
    IfThen if_then = 6;
    SwitchExpression switch_expression = 7;
    SingularOrList singular_or_list = 8;
    MultiOrList multi_or_list = 9;
    Enum enum = 10;
    Cast cast = 11;
    Subquery subquery = 12;
    Type type = 13;
  }
}

message Expression {
  oneof rex_type {
    Literal literal = 1;
    FieldReference selection = 2;
    ScalarFunction scalar_function = 3;
    WindowFunction window_function = 5;
    IfThen if_then = 6;
    SwitchExpression switch_expression = 7;
    SingularOrList singular_or_list = 8;
    MultiOrList multi_or_list = 9;
    Cast cast = 11;
    Subquery subquery = 12;
    reserved 10, 13;
    reserved "enum", "type";
  }
}

but it's obviously not without its own set of downsides. So with that I guess we have the following possible solutions:

  1. add type literal to Expression -> minimal change, but resulting data structure is weird;
  2. replace repeated Expression in ScalarFunction et al with a repeated FunctionArgument that includes Expression as an option, deprecate or reserve Expression.enum -> cleanest resulting data structure, but is a breaking change;
  3. as above but add the repeated FunctionArgument, deprecating the old one -> the same data structure but non-breaking, however requires messy producer/consumer logic to properly deal with both the new and old method for defining arguments;
  4. replace repeated Expression in ScalarFunction et al with a repeated FunctionArgument that is itself binary- and JSON-backward-compatible -> relatively clean data structure (in that enum and type literals can only appear in function argument context) but with some duplication in the proto files and upgraded consumers/producers.

My preference would depend on where the project is going in terms of release model. If a breaking overhaul to properly fix all the things that aren't nice in the proto files right now is on the table for after the initial release, I still prefer option 1 (or maybe 4) for now, and then option 2 batched up with other breaking changes later. If not, I have a weak preference for 4 over 3, because you don't need to resort to any not-so-obvious logic to maintain backward compatibility (the repetition is annoying, but relatively hard to implement incorrectly).

I personally much prefer the occasional batch of breaking changes that truly improve the format over incremental changes with deprecation, especially for young projects, because in my experience the latter makes for increasingly messy code and a confusing specification as deprecations add up. On the other hand, with the occasional overhaul, the resulting code can be much cleaner. The toplevel plan message could then be replaced with something like

message VersionedPlan {
  oneof version {
    // Latest version: unstable, just whatever is on the main branch right now.
    Plan latest = 1,
    
    // substrait.vX_Y = copy of substrait namespace made during the
    // version X.Y.0 release (excluding other version namespaces, and with
    // its own copy of VersionedPlan that replaces the other version options
    // with Empty message types, so implementations that only want to support
    // one version can use that copy of VersionedPlan and don't have to compile
    // generated code for all the other versions). Only receives non-breaking
    // bugfix updates, or maybe backported additions, released as version
    // X.Y.n++.
    v0_1.Plan v0_1 = 2,
    v0_2.Plan v0_2 = 3,
    // ...
  }
}

Producers just emit whatever version they support, consumers can choose which version or versions they want to support and can throw a clear error message when they receive a different one. We could also maintain a library along with the validator that can convert between versions, to centralize that code.

@jacques-n
Copy link
Contributor

jacques-n commented Apr 24, 2022

Some thoughts:

  • I'm strongly against the first option you described (adding type to the expression union). This moves a bunch of validation of expression trees from a protobuf pattern to a semantic analysis problem.
  • Protobuf is a supported as a binary format only. We should not be focused on trying to maintain compatibility of the json variation of that.
  • We haven't made any releases yet. If we've made some mistakes (clearly, we have), now is the perfect time to clean them up (at least the ones we recognize now). Option 3 allows existing users to handle things in the near-term and gets us to a clean state for the long term. Frankly, I'm even comfortable with a hard breaking change at this point (your option 2).
  • Number 4 creates redundancy that requires constant maintenance in the future for what I see as little benefit. (Again, let's not get ahead of ourselves on api stability.)

@jvanstraten
Copy link
Contributor Author

That all sounds fair, and I don't have anything more to add. I don't have a strong preference (besides backward compatibility, but I also know my general opinion on that is pretty extreme), so unless someone else pitches in I'll defer to your judgment. You/someone else might want to update/replace this patch though since I'm mostly AFK for the coming two weeks.

@jacques-n
Copy link
Contributor

@cpcloud, did you have any thoughts on the options/comments above?

@jvanstraten jvanstraten force-pushed the add-type-literal-expr branch 3 times, most recently from c32b3e9 to e4d3c74 Compare May 16, 2022 12:07
Type arguments can now be passed to functions. This modifies the
structure for specifying function arguments in general, deprecating
the old structure. Note that this is a breaking change w.r.t. the
JSON serialization of functions; only the binary serialization is
stable.
@jvanstraten
Copy link
Contributor Author

Implemented option 3, updated commit message to comply with conventional commits, and rebased. However, CI (IMO correctly) rejects the commit without a breaking change marker because the JSON format has changed. @jacques-n, do you consider this to be overly restrictive CI, or do you think I should add a breaking change marker, or not change args to deprecated_args and use args2 or arguments or something for the new structure? I prefer the latter (also saves me from having to rewrite dozens of test cases in the validator).

@cpcloud
Copy link
Contributor

cpcloud commented May 16, 2022

Any breaking change will cause a major version bump, FYI.

The CI uses buf to check for breaking changes and if there's a breaking change enforce that a commit has the right footer.

@jvanstraten
Copy link
Contributor Author

Which is very nice, by the way; I wasn't expecting it to be that thorough :)

@jacques-n
Copy link
Contributor

Any breaking change will cause a major version bump, FYI.

The CI uses buf to check for breaking changes and if there's a breaking change enforce that a commit has the right footer.

because the JSON format has changed. @jacques-n, do you consider this to be overly restrictive CI

Sigh. I definitely don't want to think about/worry about JSON format stuff. As such, I'd prefer to not have that check. However, given the reliance people have on the protoc code generators which all use field names to compose their methods, I guess my concern/comment is a distinction without a difference.

arguments sounds like a winner for the new field.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks finished. Thanks @jvanstraten !

@jacques-n jacques-n merged commit df98816 into substrait-io:main May 20, 2022
@jvanstraten jvanstraten deleted the add-type-literal-expr branch May 23, 2022 06:24
@jvanstraten
Copy link
Contributor Author

Oops, I think fixing the field names fell off my TODO list. Sorry about that.

Also... it looks like something went wrong with the merge/squash commit message, as it wasn't picked up by conventional commits (it's not in the release notes). This is what it ended up as:

Add support for alternative argument type support in the protobuf representation (type and option arguments) (#161)

* feat: support type function arguments in protobuf

Type arguments can now be passed to functions. This modifies the
structure for specifying function arguments in general, deprecating
the old structure.

I'll make a PR to retcon it into the release notes, but I guess this is something to look out for when merging?

@jacques-n
Copy link
Contributor

I think the problem was the top level commit message wasn't feat:

@cpcloud , I was under the impression that squash and merge combo merge messages would all be capture in the release notes. Is that not true? For example, there is a feat in this commit. Despite that being an inner commit message, would that have shown up in the next release notes? I thought it would but now I'm not sure (given the behavior of the commit here).

@curino
Copy link
Contributor

curino commented May 24, 2022

Catching up with this. From our perspective (starting to look at the project, but no significant chunks of code depending on it yet) option 2 (hard break) is cleaner as it makes the long term tidier, but I can understand folks opting for 3 to avoid current user labour. <2cents>

westonpace pushed a commit to apache/arrow that referenced this pull request Jul 5, 2022
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of.

Stuff that broke:
 - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136
 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169
 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161
 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217

Authored-by: Jeroen van Straten <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
drin pushed a commit to drin/arrow that referenced this pull request Jul 5, 2022
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of.

Stuff that broke:
 - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136
 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169
 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161
 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217

Authored-by: Jeroen van Straten <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
rkondakov pushed a commit to rkondakov/substrait that referenced this pull request Nov 21, 2023
* feat: simplify InPredicate builder
* refactor: add TestBase class with common utils
* fix: handle custom extensions in Rels reached through Expressions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants