Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(substrait): set ProjectRel output_mapping in producer #12495

Merged

Conversation

vbarua
Copy link
Contributor

@vbarua vbarua commented Sep 16, 2024

Which issue does this PR close?

Continues #12347

This PR sets the output_mapping field on ProjectRels for plans produced by DataFusion.

Rationale for this change

Given a plan

Projection: data.b, data.a + data.a, data.a
  TableScan: data projection=[a, b]",

When creating the Substrait ProjectRel, DataFusion will now set the output mapping to [2,3,4] so that the Substrait ProjectRel will match the DataFusion behaviour of only outputting expressions.

By default, if the emit_kind field is not set with an output_mapping, a Substrait ProjectRel emits all input columns followed by all expressions. This does not match the DataFusion behaviour.

This change only includes producer changes in order to ease the migration for users. Because DataFusion ignores the emit_kind field when consuming plans, DataFusion will be able to consume plans that set the output mapping correctly and plans that don't.

Once the consumer is updated to respect emit_kind, plans that don't set the output_mapping will not work as before. By shipping the producer change by itself, we allow users who serialize plans between versions the opportunity to update all of their plans before making a consumer change that would break them.

Are these changes tested?

Yes

Are there any user-facing changes?

Substrait plans with ProjectRels will now set the output_mapping field.

advanced_extension: None,
})
}
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially a useful optimization, however it becomes a bit more complicated with the introduction of the output_mapping because you a need to modify it along with the expressions. I've opted to simplify this for now to favour simplicity and correctness.

As well, I think this is better handled in when consuming plans and/or by the optimizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the datafusion optimizer already handles projection pushdown quite well -- so keeping the substrait producer simpler makes the most sense to me

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @vbarua -- these PRs are easy to review as they clearly explain why they are changing something, are well documented, and well tested 🏆

advanced_extension: None,
})
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the datafusion optimizer already handles projection pushdown quite well -- so keeping the substrait producer simpler makes the most sense to me

@alamb
Copy link
Contributor

alamb commented Sep 17, 2024

FYI @Blizzara

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

Thanks again @vbarua

@alamb alamb merged commit ec10c04 into apache:main Sep 18, 2024
24 checks passed
};

if let Some(rel::RelType::Project(p1)) = root_rel.rel_type.as_ref() {
// The WindowAggr outputs 4 columns, the Projection has 4 columns
Copy link
Contributor

@Blizzara Blizzara Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both are 3 columns?

Or well the window outputs 4 but the project outputs only 3.

I guess that's what you're saying here anyways 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that is a typo. It should have been

The WindowAggr outputs 4 columns, the Projection has 3 columns
                                                    ^^^

which is what the assert below is actually (correctly) checking.

I'll make a note to myself to update this when I make the consumer changes.

let mut window_exprs = vec![];

// create a field reference for each input field
let mut expressions = (0..window.input.schema().fields().len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could not add these ref expressions, and then use a direct emit (or remap starting from 0) instead, would that not be the same? That said it'd break roundtrip until you've gotten the consumer part done as well 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants