Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support consuming Substrait with compound signature function names #10653

Merged

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented May 24, 2024

Substrait 0.32.0+ requires functions to be specified using compound names, which include the function name as well as the arguments it takes. We don't necessarily need that information while consuming the plans, but we need to support those compound names.

Which issue does this PR close?

Closes #10412, but doesn't fix the part where DataFusion generates those outdated simple names. But that's a bit bigger fix - this small thing would be enough to unblock me if it'd be okay to merge this before fixing the producer side.

Rationale for this change

See issue.

What changes are included in this PR?

When consuming Substrait, checks if given function names are compound names, if so strips the arguments part.

Are these changes tested?

Tested manually, but will still need to add unit tests.

Are there any user-facing changes?

@Blizzara
Copy link
Contributor Author

@alamb curious what you think of fixing just the consumer side first, without touching the producer - if that'd be okay, then I can add some unit tests to this PR?

@alamb
Copy link
Contributor

alamb commented May 27, 2024

@alamb curious what you think of fixing just the consumer side first, without touching the producer - if that'd be okay, then I can add some unit tests to this PR?

I think it sounds good to me as long as there are sufficient test coverage.

Thank you for pushing this along @Blizzara

Substrait 0.32.0+ requires functions to be specified using compound names, which include the function name as well as the arguments it takes.
We don't necessarily need that information while consuming the plans, but we need to support those compound names.
@Blizzara Blizzara force-pushed the avo/substrait-function-signature-compound-names branch from f7f69c5 to f39c63b Compare May 27, 2024 16:36
@Blizzara Blizzara marked this pull request as ready for review May 27, 2024 16:53
@Blizzara
Copy link
Contributor Author

@alamb Sounds good! here's an added test, though curious if you have any thoughts on how to make it simpler - given DF doesn't yet produce those compound names, I cannot use a roundtrip test, and writing the substrait manually feels both annoying and error-prone.. Using base64 encoded binary proto would be a bit easier but then there's some recent bad experiences with people uploading malicious stuff as "test binaries" so I didn't want to do that, but happy to if you'd find it better 😅

@alamb
Copy link
Contributor

alamb commented May 28, 2024

I cannot use a roundtrip test, and writing the substrait manually feels both annoying and error-prone.

I don't have any better ideas unfortunately, until DataFusion has roundtrip support

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Blizzara -- this looks great. I really like the idea of having checked in json files with substrait from other systems (even apart from this particular feature)

The only question I have is the need to add a new (non dev) dependency. Otherwise this looks great

cc @waynexia

@@ -37,11 +37,13 @@ chrono = { workspace = true }
datafusion = { workspace = true, default-features = true }
itertools = { workspace = true }
object_store = { workspace = true }
pbjson-types = "0.6"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be a actual dependency or can it be a dev dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in more depth here, but yes I think it does need to be actual dep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said we can remove the prost-types dep, we don't need both: e6ff946

@@ -42,7 +42,7 @@ use datafusion::logical_expr::expr::{
};
use datafusion::logical_expr::{expr, Between, JoinConstraint, LogicalPlan, Operator};
use datafusion::prelude::Expr;
use prost_types::Any as ProtoAny;
use pbjson_types::Any as ProtoAny;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHy do we need to change to pbjson_types here? It seems like the json support is only for testing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this confuses me a bit as well - but the way I understand it, adding the "serde" feature to "substrait-rs" seems to change the types it expects from the original prost types into the pbjson types (https://github.com/substrait-io/substrait-rs/blob/d11b22c743df9b02b7e5f4c079091ed133982431/build.rs#L262), probably to make them all serde-serializable (https://docs.rs/pbjson-types/latest/pbjson_types/#).

Given we use the Any proto type in non-test code, I think this does need to change, even though we don't necessarily need the serializability here.. This is also why we need the pbjson-types dep as normal, not dev, dep.

I'd also prefer to not change this, but I couldn't find another way if we want to have the JSON proto plans :/

// under the License.

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the idea of having checked in substrait plans (as json protobuf) is actually likely a pretty good way to build up a library of test cases "from the wild" (not round tripped)

This is quite cool @Blizzara

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Blizzara

@waynexia
Copy link
Member

though curious if you have any thoughts on how to make it simpler - given DF doesn't yet produce those compound names, I cannot use a roundtrip test, and writing the substrait manually feels both annoying and error-prone

As a big scenario for our substrait support is to inter-cooperation with other systems, it might be a good idea to have some "integration" test, which uses another project like isthmus to produce plan and then consume them in datafusion side.

But this is not trivial to achieve, just an idea for now.

@Blizzara
Copy link
Contributor Author

As a big scenario for our substrait support is to inter-cooperation with other systems, it might be a good idea to have some "integration" test, which uses another project like isthmus to produce plan and then consume them in datafusion side.

But this is not trivial to achieve, just an idea for now.

Yeah, I have that kind of integration testing internally, kind of, as part of our use of datafusion-substrait (see #10412 (comment)). I'll see if I can at some point contribute something like it here as well :)

@waynexia
Copy link
Member

I've skimmed this patch and it looks good to me 👍 I'll take a detailed look into the test part tomorrow.

@richtia
Copy link

richtia commented May 28, 2024

though curious if you have any thoughts on how to make it simpler - given DF doesn't yet produce those compound names, I cannot use a roundtrip test, and writing the substrait manually feels both annoying and error-prone

As a big scenario for our substrait support is to inter-cooperation with other systems, it might be a good idea to have some "integration" test, which uses another project like isthmus to produce plan and then consume them in datafusion side.

But this is not trivial to achieve, just an idea for now.

FYI...The substrait project also has a repo that aims to do integration testing between different substrait producers/consumers. https://github.com/substrait-io/consumer-testing

@alamb
Copy link
Contributor

alamb commented May 29, 2024

though curious if you have any thoughts on how to make it simpler - given DF doesn't yet produce those compound names, I cannot use a roundtrip test, and writing the substrait manually feels both annoying and error-prone

As a big scenario for our substrait support is to inter-cooperation with other systems, it might be a good idea to have some "integration" test, which uses another project like isthmus to produce plan and then consume them in datafusion side.
But this is not trivial to achieve, just an idea for now.

FYI...The substrait project also has a repo that aims to do integration testing between different substrait producers/consumers. https://github.com/substrait-io/consumer-testing

I filed #10710 to track. Thank you @richtia

@alamb
Copy link
Contributor

alamb commented May 29, 2024

Thanks again everyone!

@alamb alamb merged commit 80a6b65 into apache:main May 29, 2024
25 checks passed
@Blizzara Blizzara deleted the avo/substrait-function-signature-compound-names branch May 29, 2024 14:30
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…pache#10653)

* Support consuming Substrait with compound signature function names

Substrait 0.32.0+ requires functions to be specified using compound names, which include the function name as well as the arguments it takes.
We don't necessarily need that information while consuming the plans, but we need to support those compound names.

* Add a test for using "not:bool"

* clippy fixes

* Add license to new file

* Apply suggestions from code review

Co-authored-by: Andrew Lamb <[email protected]>

* remove prost-types dep as it's replaced by pbjson-types

---------

Co-authored-by: Andrew Lamb <[email protected]>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Substrait integration doesn't recognize typed functions
4 participants