-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39748][SQL][SS][FOLLOWUP] Fix a bug on column stat in LogicalRDD on mismatching exprIDs #37187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…DD on mismatching exprIDs
|
cc. @cloud-fan @viirya |
|
|
||
| val rewrittenOriginLogicalPlan = originLogicalPlan.map { plan => | ||
| val projectList = output.map { attr => | ||
| Alias(attr, attr.name)(exprId = rewrite.getOrElse(attr, attr).exprId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As rewrite is a map for all output. We already can get rewrite(attr) instead of getOrElse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more about the sake of defensive programming - if there is a bug which makes the two set of columns be out of sync, we just allow them to be out of sync in future instead of failing the query, given that the impact of two set of columns be out of sync is not that quite serious, e.g. column stat won't be available. (vendors/3rd parties may still want to leverage it for major functionality though.)
In opposite way, I'm also in favor of fail-fast, setting the precondition that "two set of columns should be in sync", and assert the precondition on initialization of the class. After that we can safely assume that precondition is respected, and then it'd be safe to just use rewrite(attr) here.
I'm fine either way. WDYT? cc. @cloud-fan as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay for me. Just a nit comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not really. My bad you're right. It only looks into the output of LogicalRDD. (And this code wouldn't work in any way if there are out of sync between two sets of columns.)
Let me reflect the change.
| }.asInstanceOf[SortOrder]) | ||
|
|
||
| val rewrittenOriginLogicalPlan = originLogicalPlan.map { plan => | ||
| assert(output == plan.output, "The output columns are expected to the same for output " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I added this assertion on initialization as precondition, and realized canonicalization breaks the precondition. (output is canonicalized, but originLogicalPlan is not a target of canonicalization)
I wouldn't expect Spark calls newInstance against canonicalized node, but please correct me if I'm mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Spark doesn't call newInstance with canonicalized node. cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can't think of any use case leveraging canonicalized node to even start with something.
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
This PR fixes a bug on #37161 (described the bug in below section) via making sure the output columns in LogicalRDD are always the same with output columns in originLogicalPlan in LogicalRDD, which is needed to inherit the column stats.
Why are the changes needed?
Stats for columns in originLogicalPlan refer to the columns in originLogicalPlan, which could be different from the columns in output of LogicalRDD in terms of expression ID.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New UT