-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SanityCheckPlan should compare UnionExec inputs to requirements for output (parent). #12414
base: main
Are you sure you want to change the base?
SanityCheckPlan should compare UnionExec inputs to requirements for output (parent). #12414
Conversation
…re constant columns in sort
b5e60ab
to
a26c49e
Compare
…based upon the Union's parent vs the Union's input
a26c49e
to
782e18d
Compare
Thanks @wiedld for the detailed description of the problem. Once it's ready, I will review it thoroughly. |
When I briefly look at the solution, I actually think we shouldn't patch the sanity checker to relax it. A more accurate solution seems to be your alternate proposal:
You can actually add constants, as there is a flag, " However, after adding constants, I am not sure if |
I agree with @berkaysynnada that we should fix the calculation rather than relax the sanity checker in DataFusion We (already) patch DataFusion to skip the sanity check for certain plan nodes so we aren't blocked downstream. I think we should focus on the "right"fix that allows the sanity checker to run but also correctly recognize the plan is valid |
…ictions based upon the Union's parent vs the Union's input" This reverts commit a50edc3.
caveat: this has an unintended side effect, as the EnforceSorting removes the sort_expr from one input/side of the UnionExec (where it's not constant)
d928d04
to
4bd4db0
Compare
04)------SortExec: TopK(fetch=2), expr=[d@4 ASC NULLS LAST,c@1 ASC NULLS LAST,a@2 ASC NULLS LAST,b@0 ASC NULLS LAST], preserve_partitioning=[false] | ||
04)------SortExec: TopK(fetch=2), expr=[d@4 ASC NULLS LAST,c@1 ASC NULLS LAST,b@0 ASC NULLS LAST], preserve_partitioning=[false] | ||
05)--------ProjectionExec: expr=[b@1 as b, c@2 as c, a@0 as a, NULL as a0, d@3 as d] | ||
06)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b, c, d], output_ordering=[c@2 ASC NULLS LAST], has_header=true | ||
07)------SortExec: TopK(fetch=2), expr=[d@4 ASC NULLS LAST,c@1 ASC NULLS LAST,a0@3 ASC NULLS LAST,b@0 ASC NULLS LAST], preserve_partitioning=[false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can actually add constants, as there is a flag, "across_partitions," that indicates whether the value is constant across all partitions or only in its corresponding partition
Made the change per suggestion (demonstration only, not final commits), and I'm not sure this is the proper fix. If I add the constants on the union's equivalence properties, there are other ramifications because:
- the EquivalenceProperties::output_ordering remove the constants
- the EquivalenceProperties::normalized_oeq_class remove the constants
In the example above, the sort orders [a@2 ASC NULLS LAST]
and [a0@3 ASC NULLS LAST]
are removed on non-constant projections. The EnforceSorting optimization adds (and pushes down) the SortExecs, but the change itself really occurs based upon the EquivalenceProperties's definition of a (non-constant) sort order. Since the UnionExec listed certain constants -- they are removed from the sort order.
I started hacking around this in the EnforceSorting, but it feels like the suggested change (adding to constants) may be actually changing the definition of what the constants are? 🤔 Am I heading in the right direction here?
I think I misled you unintentionally. The part we need to focus on is datafusion/datafusion/physical-expr/src/equivalence/properties.rs Lines 1559 to 1578 in 3ece7a7
When I think about what we need to obtain as the finest ordering from the union of two children:
Then the union should have: To get this result, we need to keep track of the left and right constants of the children, which are not allowed to be placed in the union's constants. Going back to my example above, these child-specific constants can be treated as:
If we reapply the ordering calculation part I shared above on these updated child properties, we should get what we need: |
I filed #12446 to track this issue |
Which issue does this PR close?
Resolves #12446
We have failures caused by SanityCheckPlan not considering the constants in the UnionExec input orderings. We made an exact reproducer in the first commit's test case.
Rationale for this change
Given the following scenario:
The SanityCheckPlan was failing, because the UnionExec has the following orderings vs constants:
which means that no single ordering (minus constants) could fulfill the
[proj1 sorted, proj2 sorted]
sort requirement. Because the UnionExec has it's input orderings but not it's input constants.What changes are included in this PR?
I elected to perform the comparisons (during the SanityPlanCheck) between the UnionExec's parent and children, skipping the union exec itself.
Alternatively, I could have elected to:
Are these changes tested?
Yes.
Are there any user-facing changes?
No.