-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join #28556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #122750 has finished for PR 28556 at commit
|
|
retest this please |
|
Test build #122752 has finished for PR 28556 at commit
|
|
cc @maropu |
| case Project(projectList, child) | ||
| if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) => | ||
| getAliasSubMap(projectList) | ||
| val exprsToPrune = projectList ++ child.expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: exprsToPrune -> exprCandidatesToPrune?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
| }) | ||
| }).transformExpressions { | ||
| case f: ExtractValue if nestedFieldToAlias.contains(f) => | ||
| nestedFieldToAlias(f).toAttribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed only for supporting joins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for example RepartitionByExpression also needs this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I got it. It seems this change is related to https://github.com/apache/spark/pull/28556/files#diff-43334bab9616cc53e8797b9afa9fc7aaL207-L215
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the operators which have expressions should need this to replace ExtractValue and nested column aliases.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Show resolved
Hide resolved
| val newGenerate = g.copy(generator = newGenerator) | ||
|
|
||
| NestedColumnAliasing.replaceChildrenWithAliases(newGenerate, attrToAliases) | ||
| NestedColumnAliasing.replaceChildrenWithAliases(g, nestedFieldToAlias, attrToAliases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we need to update the method name of replaceChildrenWithAliases. We don't need Children in the name, anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the method name.
|
Test build #122916 has finished for PR 28556 at commit
|
|
retest this please |
|
@maropu I addressed your comments. Could you help take another look? Thanks. |
|
Test build #122917 has finished for PR 28556 at commit
|
| test("Pushing a single nested field projection - negative") { | ||
| val ops = Seq( | ||
| (input: LogicalPlan) => input.distribute('name)(1), | ||
| (input: LogicalPlan) => input.distribute($"name.middle")(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, looks nice. This PR could support this case.
| .analyze | ||
| comparePlans(optimized1, expected1) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary line break.
| checkAnswer(query3, Row("abc") :: Row(null) :: Nil) | ||
| } | ||
|
|
||
| testSchemaPruning("select one deep nested complex field after outer join") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the tests.
| "struct<contactId:int>") | ||
| checkAnswer(query1, Row("X.") :: Row("Y.") :: Nil) | ||
|
|
||
| val query2 = sql("select contacts.name.middle from contacts, departments where " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think its better to use uppercases for SQL keywords where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems all tests in this test suite are using lowercases. Changing all tests seems too bothering... :)
|
I left some minor comments though, it looks okay. cc: @dongjoon-hyun @dbtsai |
|
Test build #122922 has finished for PR 28556 at commit
|
|
Test build #122966 has finished for PR 28556 at commit
|
|
retest this please |
|
Test build #122977 has finished for PR 28556 at commit
|
|
retest this please |
|
Test build #123168 has finished for PR 28556 at commit
|
|
ping @cloud-fan @dongjoon-hyun |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Show resolved
Hide resolved
|
Test build #123785 has finished for PR 28556 at commit
|
| val exprCandidatesToPrune = projectList ++ child.expressions | ||
| getAliasSubMap(exprCandidatesToPrune, child.producedAttributes.toSeq) | ||
|
|
||
| case plan if SQLConf.get.nestedSchemaPruningEnabled && canPruneOn(plan) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big deal but I would rename plan to p to avoid shadowing the plan argument. At least my IDE complains on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it in other PR. Thanks.
| if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) => | ||
| getAliasSubMap(projectList) | ||
| val exprCandidatesToPrune = projectList ++ child.expressions | ||
| getAliasSubMap(exprCandidatesToPrune, child.producedAttributes.toSeq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya, just to clarify, you added producedAttributes here just to be safe but not related to the current changes (?). Seems Join and RepartitionByExpression have an empty producedAttributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, if it's going to output, it shouldn't be pruned anyway.
|
Merged to master. |
What changes were proposed in this pull request?
Currently we only push nested column pruning through a few operators such as LIMIT, SAMPLE, etc. This patch extends the feature to other operators including RepartitionByExpression, Join.
Why are the changes needed?
Currently nested column pruning only applied on a few operators. It limits the benefit of nested column pruning. Extending nested column pruning coverage to make this feature more generally applied through different queries.
Does this PR introduce any user-facing change?
Yes. More SQL operators are covered by nested column pruning.
How was this patch tested?
Added unit test, end-to-end tests.