Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include input fields as output for Substrait consumer #12225

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Lordworms
Copy link
Contributor

@Lordworms Lordworms commented Aug 29, 2024

Which issue does this PR close?

Closes #12204

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Lordworms Lordworms marked this pull request as ready for review August 29, 2024 06:26
@alamb
Copy link
Contributor

alamb commented Sep 1, 2024

FYI @Blizzara

@Blizzara
Copy link
Contributor

Blizzara commented Sep 2, 2024

I don't feel this is right - unless I misunderstood something. This tries to fix the problem at the root level of the plan, so that we'd produce the columns the plan asked for, and it works in this test case where there is only a single project in the plan, but I don't think it would work for any more complicated plan.

I think the proper fix would be around

let mut exprs: Vec<Expr> = vec![];
, to add all of the output fields of the project's input rel into the exprs list (and also names I guess) before adding the new expressions. Or something along those lines

@Lordworms
Copy link
Contributor Author

Lordworms commented Sep 3, 2024

I don't feel this is right - unless I misunderstood something. This tries to fix the problem at the root level of the plan, so that we'd produce the columns the plan asked for, and it works in this test case where there is only a single project in the plan, but I don't think it would work for any more complicated plan.

I think the proper fix would be around

let mut exprs: Vec<Expr> = vec![];

, to add all of the output fields of the project's input rel into the exprs list (and also names I guess) before adding the new expressions. Or something along those lines

Thanks for the suggrestions! I'll take a try and adding more test cases. In the begining I was going to directly change the initisal expressions in from_substrait_rel but feel like that would trigger lots of other test coverages which relies on this function. I'll pass the names to this function

@vbarua
Copy link
Contributor

vbarua commented Sep 4, 2024

substrait-java has a similar issue with mapping Substrait Projects, which include the input fields in the output, with Calcite Projects which do not. As @Blizzara suggests, a good fix for this is to include all of the input fields as expression in the Project. The following substrait-java code is what handles this https://github.com/substrait-io/substrait-java/blob/c8c31ecd3da047a4de0c33b21f3e960df628bda0/isthmus/src/main/java/io/substrait/isthmus/SubstraitRelNodeConverter.java#L155-L171

@Lordworms
Copy link
Contributor Author

substrait-java has a similar issue with mapping Substrait Projects, which include the input fields in the output, with Calcite Projects which do not. As @Blizzara suggests, a good fix for this is to include all of the input fields as expression in the Project. The following substrait-java code is what handles this https://github.com/substrait-io/substrait-java/blob/c8c31ecd3da047a4de0c33b21f3e960df628bda0/isthmus/src/main/java/io/substrait/isthmus/SubstraitRelNodeConverter.java#L155-L171

I think that's what I did in the updated code, just some other points I mentioned in the review

@@ -214,6 +214,9 @@ pub async fn from_substrait_plan(
Ok(from_substrait_rel(ctx, rel, &extensions).await?)
},
plan_rel::RelType::Root(root) => {
if !root.names.is_empty() {
extensions = extensions.with_projection_names(root.names.clone());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of storing the output names in the extension like this? They shouldn't be needed anywhere else but here, and it looks like the code below already applies them to the generated plan.

@@ -439,6 +458,27 @@ pub async fn from_substrait_rel(
}
names.insert(new_name);
}
let schema = input.schema();
if let (Some(extensions_names), true) =
(extensions.names.as_ref(), p.common.is_some())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the final output names here doesn't really make sense. Those are intended to apply names to the results of the Substrait plan.

Given a Substrait plan like:

Project[a + b]
     NamedScan[a, b, c, d, e]

What the DataFusion Project needs to include is all of the fields of the input relation like:

DataFusionProject[a, b, c, d, e, a + b]
    DataFusionTable[a, b, c, d, e]

@@ -412,6 +428,10 @@ pub async fn from_substrait_rel(
);
let mut names: HashSet<String> = HashSet::new();
let mut exprs: Vec<Expr> = vec![];
input.schema().iter().for_each(|(qualifier, field)| {
exprs.push(col(Column::from((qualifier, field))))
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like the right thing - but then to get tests passing again I think we'll also need to add support for the remap https://substrait.io/relations/basics/#emit-output-ordering, both here and in producer...

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

I am trying to clean up the review queue, so marking this PR as draft as the CI tests are failing. Let me know if that wasn't right

@alamb alamb marked this pull request as draft September 6, 2024 20:45
@vbarua
Copy link
Contributor

vbarua commented Sep 6, 2024

I have some work in progress around remaps for #12347 which I suspect will overlap with this. @Lordworms I'd be happy to pull your changes into my work if that makes sense to you.

@Lordworms
Copy link
Contributor Author

I have some work in progress around remaps for #12347 which I suspect will overlap with this. @Lordworms I'd be happy to pull your changes into my work if that makes sense to you.

Sure, that would be fine, Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Projects from Substrait do not include input fields as output fields
4 participants