-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate ordering equivalence for expressions (rather than just columns) #8281
Conversation
* Discover ordering of complex expressions in group by and window partition by * Remove unnecessary tests * Update comments * Minor changes * Better projection support complex expression support * Fix failing test * Simplifications * Simplifications * Add is end flag * Simplifications * Simplifications * Simplifications * Minor changes * Minor changes * Minor changes * All tests pass * Change implementation of find_longest_permutation * Minor changes * Minor changes * Remove projection section * Remove projection implementation * Fix linter errors * Remove projection sections * Minor changes * Add docstring comments * Add comments * Minor changes * Minor changes * Add comments * simplifications
I will do another deep dive on this tomorrow |
Thank you @mustafasrepo -- I plan to review this carefully tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a third pass today and LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @mustafasrepo -- can you explain how non monotonic functions are handled (like extract(month from ts)
) where the expressions don't preserve the ordering of their inputs? I didn't see any tests for such functions or code that handles them
Sorry I may have just missed it / don't understand the code
@@ -357,15 +357,19 @@ mod tests { | |||
let physical_plan = | |||
sort_preserving_merge_exec(vec![sort_expr("a", &schema)], sort); | |||
|
|||
let expected_input = ["SortPreservingMergeExec: [a@0 ASC NULLS LAST]", | |||
let expected_input = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did these plans actually change? Or is this just whitespace changes?
If it is just whitespace changes I would really appreciate breaking such changes out into their own PRs as they are much faster / easier to review and merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just whitespace changes. Agree with your suggestion in general, we will try to improve our self discipline on that
@@ -3787,7 +3787,7 @@ pub(crate) mod tests { | |||
fn repartition_transitively_past_sort_with_projection_and_filter() -> Result<()> { | |||
let schema = schema(); | |||
let sort_key = vec![PhysicalSortExpr { | |||
expr: col("c", &schema).unwrap(), | |||
expr: col("a", &schema).unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this test changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the plan, there is a filter enforcing c = 0
-- so having a sort on c
is a no-op logically. After recent improvements we are able to recognize things like that and remove sorts completely, so the test became moot in its old form.
query P | ||
SELECT date_bin('15 minutes', ts) as time_chunks | ||
FROM csv_with_timestamps | ||
GROUP BY (date_bin('15 minutes', ts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused about the seeming extra (
and )
here and in the above query:
Is there any reason it is not:
GROUP BY (date_bin('15 minutes', ts)) | |
GROUP BY date_bin('15 minutes', ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason, it was unnecessary -- now removed
---- | ||
2018-12-13T12:00:00 | ||
2018-11-13T17:00:00 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add a negative test that can not work in streaming mode with an expression that doesn't preserve the order?
perhaps something like
SELECT extract(month from ts) as months,
FROM csv_with_timestamps
GROUP BY extract(month from ts)
ORDER BY time_chunks DESC
LIMIT 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I added a test that shows the query works when the function is monotonic but not otherwise for unbounded sources. I also added a test that shows the query works regardless of the function's monotonicity for bounded sources.
Thanks for the review @alamb. I answered your questions and added a couple tests to address your suggestions (verifying behavior difference for monotonic/nonmonotonic functions). Regarding your question, complex expression orderings are calculated via the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ozankabak -- I didn't review the logic changes carefully but I don't think I need to either given you have. The tests added to datafusion/sqllogictest/test_files/groupby.slt
look great to me and convince me this works correctly
For some reason sqllogictests are failing in CI (they pass locally for me). Looking at the logs, it seems like the expected/actual are the same, so I can't figure out why this is the case. I'll let @mustafasrepo figure it out and fix tomorrow and we can merge then. In the meantime if there is more suggestions for improvements from the community we will incorporate them as well |
Which issue does this PR close?
Improves the situation on #8064
Rationale for this change
Currently we cannot determine whether ordering requirement of the complex expressions are satisfied.
Consider the case where requirement is
[a+b ASC]
, and we know that existing orderings are[a ASC], [b ASC]
. It should be possible to understand that given[a ASC] and [b ASC]
, requirement:[a+b ASC]
is satisfied (Please note that we already have mechanism to make deduction. However, it is not considered in requirement satisfaction analysis. Currently we search requirement among existing orderings, this limits our ability during analysis).What changes are included in this PR?
This PR re-implements the
find_longest_permutation
andordering_satisfy_requirement
so that ordering of the complex(composed) expressions are calculated correctly, and considered during analysis. Most of the changes comes from the tests, and test related changes.Are these changes tested?
Yes.
Are there any user-facing changes?