-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for alternative argument type support in the protobuf representation (type and option arguments) #161
Add support for alternative argument type support in the protobuf representation (type and option arguments) #161
Conversation
It feels a bit odd to have a |
Counterpoint: message ScalarFunction {
uint32 function_reference = 1;
repeated FunctionArgument args = 2;
Type output_type = 3;
}
message FunctionArgument {
oneof arg_type {
Enum enum = 1;
Type type = 2;
Expression value = 3;
}
}
message Expression {
oneof rex_type {
Literal literal = 1;
FieldReference selection = 2;
ScalarFunction scalar_function = 3;
WindowFunction window_function = 5;
IfThen if_then = 6;
SwitchExpression switch_expression = 7;
SingularOrList singular_or_list = 8;
MultiOrList multi_or_list = 9;
Cast cast = 11;
Subquery subquery = 12;
}
} but that'd be a breaking change. |
Shoot, not sure how I got this wrong since it is so clearly spelled out on the function definition side. I think the breaking change is the right one. If the args wasn't a repeated, we could do a clean breaking change using oneof at the top level. Unfortunately, oneof isn't supported for repeated types. I think the right solution is to change to do something similar to:
Note that the change above would is a name-based-json-from-protobuf breaking change. That's because one way protobuf to json works is to use field names as opposed to field ordinals. I think that is okay. I don't want there to be an expectation that protobuf's json format is a stable api, only protobuf binary itself. If people want json stability, they should use the field ordinal json representation (instead of the name based one). @jvanstraten, want to propose a patch? Also note that we also need to come up with something to represent the arbitrary option string values (the third type of function arguments) |
This response got very long as I typed it out, so tl;dr:
But, if you disagree with postponing breaking changes until after the first release, I'll change the PR to use a In case my reasoning isn't clear, the long versions follow. My understanding of functions right now...... based on the various bits of documentation, protobuf messages, comments, slack, what Isthmus outputs, etc is:
where:
Please shoot holes in this if I got anything wrong, since getting it exactly right is pretty important for the validator. But, unwritten intentions aside, I'm pretty sure this interpretation works for possible function definitions and invocations, except for functions that require type arguments. Hence the proposed addition, and me moving I personally think making a breaking change for this right now is a bad idea, because...
Basically, I'd say it's better to bundle breaking changes up via releases than to trickle them into the specification before we have versioning set up. Non-rhetorical answer to "what'd be the type of Expression.enum or my proposed Expression.type_literal variant outside the context of function arguments?"I'd say it simply is undefined, because it's illegal to do it. The best example I could come up with for something that looks valid but isn't would be As for what I mean with "undefined" and "illegal" in practice; well, I'd say a consumer is free to reject a plan which does something that's illegal, or interpret it however it wants, or invoke |
I don't think introducing an additional FunctionArgument is a breaking change so I think I'm a little confused by your comment. (Replacing the existing repeated field would be breaking.) What are you saying is breaking exactly? I'm not inclined to add type to the expression tree as I think that adds more complexity to semantic validation of expressions (already one of the more complex patterns). I think it was a mistake to add enum reflecting back on it. An enum can only go in a function argument position. All other things in expression fit in many places. That was a mistake on my part. I'd be inclined to mark that deprecated as well and keep clean in function argument pattern. Thoughts? |
Looking at your alternative again, I misinterpreted what "deprecated" means for protobuf and assumed your solution would be a completely breaking change. I guess a producer that wants to be compatible with consumers that haven't been upgraded yet could populate both the I came up with another non-breaking alternative: message ScalarFunction {
uint32 function_reference = 1;
repeated FunctionArgument args = 2;
Type output_type = 3;
}
message FunctionArgument {
oneof arg_type {
Literal literal = 1;
FieldReference selection = 2;
ScalarFunction scalar_function = 3;
WindowFunction window_function = 5;
IfThen if_then = 6;
SwitchExpression switch_expression = 7;
SingularOrList singular_or_list = 8;
MultiOrList multi_or_list = 9;
Enum enum = 10;
Cast cast = 11;
Subquery subquery = 12;
Type type = 13;
}
}
message Expression {
oneof rex_type {
Literal literal = 1;
FieldReference selection = 2;
ScalarFunction scalar_function = 3;
WindowFunction window_function = 5;
IfThen if_then = 6;
SwitchExpression switch_expression = 7;
SingularOrList singular_or_list = 8;
MultiOrList multi_or_list = 9;
Cast cast = 11;
Subquery subquery = 12;
reserved 10, 13;
reserved "enum", "type";
}
} but it's obviously not without its own set of downsides. So with that I guess we have the following possible solutions:
My preference would depend on where the project is going in terms of release model. If a breaking overhaul to properly fix all the things that aren't nice in the proto files right now is on the table for after the initial release, I still prefer option 1 (or maybe 4) for now, and then option 2 batched up with other breaking changes later. If not, I have a weak preference for 4 over 3, because you don't need to resort to any not-so-obvious logic to maintain backward compatibility (the repetition is annoying, but relatively hard to implement incorrectly). I personally much prefer the occasional batch of breaking changes that truly improve the format over incremental changes with deprecation, especially for young projects, because in my experience the latter makes for increasingly messy code and a confusing specification as deprecations add up. On the other hand, with the occasional overhaul, the resulting code can be much cleaner. The toplevel plan message could then be replaced with something like message VersionedPlan {
oneof version {
// Latest version: unstable, just whatever is on the main branch right now.
Plan latest = 1,
// substrait.vX_Y = copy of substrait namespace made during the
// version X.Y.0 release (excluding other version namespaces, and with
// its own copy of VersionedPlan that replaces the other version options
// with Empty message types, so implementations that only want to support
// one version can use that copy of VersionedPlan and don't have to compile
// generated code for all the other versions). Only receives non-breaking
// bugfix updates, or maybe backported additions, released as version
// X.Y.n++.
v0_1.Plan v0_1 = 2,
v0_2.Plan v0_2 = 3,
// ...
}
} Producers just emit whatever version they support, consumers can choose which version or versions they want to support and can throw a clear error message when they receive a different one. We could also maintain a library along with the validator that can convert between versions, to centralize that code. |
Some thoughts:
|
That all sounds fair, and I don't have anything more to add. I don't have a strong preference (besides backward compatibility, but I also know my general opinion on that is pretty extreme), so unless someone else pitches in I'll defer to your judgment. You/someone else might want to update/replace this patch though since I'm mostly AFK for the coming two weeks. |
@cpcloud, did you have any thoughts on the options/comments above? |
c32b3e9
to
e4d3c74
Compare
Type arguments can now be passed to functions. This modifies the structure for specifying function arguments in general, deprecating the old structure. Note that this is a breaking change w.r.t. the JSON serialization of functions; only the binary serialization is stable.
e4d3c74
to
ab91ee2
Compare
Implemented option 3, updated commit message to comply with conventional commits, and rebased. However, CI (IMO correctly) rejects the commit without a breaking change marker because the JSON format has changed. @jacques-n, do you consider this to be overly restrictive CI, or do you think I should add a breaking change marker, or not change |
Any breaking change will cause a major version bump, FYI. The CI uses |
Which is very nice, by the way; I wasn't expecting it to be that thorough :) |
Sigh. I definitely don't want to think about/worry about JSON format stuff. As such, I'd prefer to not have that check. However, given the reliance people have on the
|
a1f9944
to
0dade96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks finished. Thanks @jvanstraten !
Oops, I think fixing the field names fell off my TODO list. Sorry about that. Also... it looks like something went wrong with the merge/squash commit message, as it wasn't picked up by conventional commits (it's not in the release notes). This is what it ended up as:
I'll make a PR to retcon it into the release notes, but I guess this is something to look out for when merging? |
I think the problem was the top level commit message wasn't feat: @cpcloud , I was under the impression that squash and merge combo merge messages would all be capture in the release notes. Is that not true? For example, there is a feat in this commit. Despite that being an inner commit message, would that have shown up in the next release notes? I thought it would but now I'm not sure (given the behavior of the commit here). |
Catching up with this. From our perspective (starting to look at the project, but no significant chunks of code depending on it yet) option 2 (hard break) is cleaner as it makes the long term tidier, but I can understand folks opting for 3 to avoid current user labour. <2cents> |
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of. Stuff that broke: - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217 Authored-by: Jeroen van Straten <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of. Stuff that broke: - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217 Authored-by: Jeroen van Straten <[email protected]> Signed-off-by: Weston Pace <[email protected]>
* feat: simplify InPredicate builder * refactor: add TestBase class with common utils * fix: handle custom extensions in Rels reached through Expressions
Unless I'm misunderstanding something, it seems like it's currently impossible to call functions that take type arguments, such as the one from
functions_cast.yaml
:since there is no way to pass type literals (copypasting the relevant bits of
algebra.proto
):so this PR adds an expression type to accommodate.