-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32163][SQL] Nested pruning should work even with cosmetic variations #28988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @maropu @dongjoon-hyun @frankyin-factual |
|
Great! Thanks. I gonna apply the same patch to my branch. |
|
You can rebase after this gets merged. |
|
Does Jenkins not work again? |
|
ok to test |
| val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]] | ||
| .filter(!_.references.subsetOf(exclusiveAttrSet)) | ||
| .groupBy(_.references.head) | ||
| .groupBy(_.references.head.canonicalized.asInstanceOf[Attribute]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, @viirya @frankyin-factual !
| testSchemaPruning("SPARK-32163: nested pruning should work even with cosmetic variations") { | ||
| withTempView("contact_alias") { | ||
| sql("select * from contacts") | ||
| .repartition(100, col("name.first"), col("name.last")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue cannot happen in branch-3.0?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Lines 80 to 85 in fc2660c
| private def canProjectPushThrough(plan: LogicalPlan) = plan match { | |
| case _: GlobalLimit => true | |
| case _: LocalLimit => true | |
| case _: Repartition => true | |
| case _: Sample => true | |
| case _ => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is only for master branch. However this issue can happen in branch-3.0 too. I added another new test here, which is for branch-3.0.
But when we backport this to branch-3.0, we need to remove first test case as it will fail on checkScan(query, "struct<name:struct<first:string,last:string>>"), because branch-3.0 doesn't prune for repartition by expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, thanks for adding the test.
| if (nestedFieldToAlias.nonEmpty && | ||
| nestedFieldToAlias | ||
| .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) } | ||
| dedupNestedFields.map(_.canonicalized.asInstanceOf[ExtractValue]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need the cast here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is related to the query failure in the test? Looks it is just an optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, we cannot avoid the generating duplicated aliases (name#1.first AS _gen_alias_52#52 and name#1.first AS _gen_alias_53#53) below? Is this technically difficult? (This is not related to this PR and just a question)
scala> sql("select name.first from contact_alias").explain()
== Physical Plan ==
*(2) Project [_gen_alias_52#52 AS first#50]
+- Exchange hashpartitioning(_gen_alias_53#53, _gen_alias_54#54, 100), false, [id=#46]
+- *(1) Project [name#1.first AS _gen_alias_52#52, name#1.first AS _gen_alias_53#53, name#1.last AS _gen_alias_54#54]
+- FileScan parquet [name#1,p#8] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/contacts], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<first:string,last:string>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to cosmetic variations, Extractors with different qualifiers, for example, will cause incorrect total number of fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the duplicated aliases, we can avoid it. We can work on canonicalized extractors when generating aliases.
But we also need to convert coming extractors to canonicalized versions when we look up into the alias map.
Currently the code looks clear. And it seems not a big deal, and I think it is rare case that there are multiple extractors with cosmetic difference. So currently I don't try to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the duplicated aliases, we can avoid it. We can work on canonicalized extractors when generating aliases.
But we also need to convert coming extractors to canonicalized versions when we look up into the alias map.
Currently the code looks clear. And it seems not a big deal, and I think it is rare case that there are multiple extractors with cosmetic difference. So currently I don't try to do that.
Thanks for the explanation. Looks okay.
| val query2 = sql("select friends.middle, col from contact_alias") | ||
| checkScan(query2, "struct<friends:array<struct<first:string,middle:string>>>") | ||
| checkAnswer(query2, Row(Array("Z."), "Susan") :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu This test is for branch-3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the test case.
|
Looks okay and anyone could check this? @dongjoon-hyun @dbtsai |
|
Test build #124928 has finished for PR 28988 at commit
|
|
retest this please |
|
FYI: @zhengruifeng @ScrapCodes (Not sure which one is a release manager though) I think this fix's better to be included in the v3.0.1 release. |
|
Test build #124991 has finished for PR 28988 at commit
|
|
retest this please... |
This comment has been minimized.
This comment has been minimized.
|
retest this please |
This comment has been minimized.
This comment has been minimized.
|
retest this please |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
The Jenkins looks unstable. |
|
Yeah, Shane is working hard on this issue now, so we need to wait a little until it stabilizes. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
retest this please |
|
Test build #125166 has finished for PR 28988 at commit
|
|
retest this please |
|
Test build #125194 has finished for PR 28988 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you so much, @viirya , @maropu , @frankyin-factual , @HyukjinKwon .
Merged to master.
The last commit is a only-indentation fix.
|
Thanks all. Sure, let me create a backporting PR for branch-3.0. |
|
Test build #125233 has finished for PR 28988 at commit
|
… variations ### What changes were proposed in this pull request? This patch proposes to deal with cosmetic variations when processing nested column extractors in `NestedColumnAliasing`. Currently if cosmetic variations are in the nested column extractors, the query is not optimized. This backports #28988 to branch-3.0. ### Why are the changes needed? If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug in nested column pruning. ### How was this patch tested? Unit test. Closes #29027 from viirya/SPARK-32163-3.0. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This patch proposes to deal with cosmetic variations when processing nested column extractors in
NestedColumnAliasing. Currently if cosmetic variations are in the nested column extractors, the query is not optimized.Why are the changes needed?
If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well.
For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning.
Does this PR introduce any user-facing change?
Yes, fixing a bug in nested column pruning.
How was this patch tested?
Unit test.