[SPARK-32163][SQL] Nested pruning should work even with cosmetic variations #28988

viirya · 2020-07-03T02:19:11Z

What changes were proposed in this pull request?

This patch proposes to deal with cosmetic variations when processing nested column extractors in NestedColumnAliasing. Currently if cosmetic variations are in the nested column extractors, the query is not optimized.

Why are the changes needed?

If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning.

Does this PR introduce any user-facing change?

Yes, fixing a bug in nested column pruning.

How was this patch tested?

Unit test.

viirya · 2020-07-03T02:20:49Z

cc @maropu @dongjoon-hyun @frankyin-factual

ukby1234 · 2020-07-03T02:46:48Z

Great! Thanks. I gonna apply the same patch to my branch.

viirya · 2020-07-03T02:50:36Z

You can rebase after this gets merged.

viirya · 2020-07-03T05:46:43Z

Does Jenkins not work again?

viirya · 2020-07-03T05:46:48Z

ok to test

maropu · 2020-07-05T11:45:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

    val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
      .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(_.references.head.canonicalized.asInstanceOf[Attribute])


Nice catch, @viirya @frankyin-factual !

maropu · 2020-07-05T11:48:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+  testSchemaPruning("SPARK-32163: nested pruning should work even with cosmetic variations") {
+    withTempView("contact_alias") {
+      sql("select * from contacts")
+        .repartition(100, col("name.first"), col("name.last"))


This issue cannot happen in branch-3.0?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

Lines 80 to 85 in fc2660c

private def canProjectPushThrough(plan: LogicalPlan) = plan match {

case _: GlobalLimit => true

case _: LocalLimit => true

case _: Repartition => true

case _: Sample => true

case _ => false

This test case is only for master branch. However this issue can happen in branch-3.0 too. I added another new test here, which is for branch-3.0.

But when we backport this to branch-3.0, we need to remove first test case as it will fail on checkScan(query, "struct<name:struct<first:string,last:string>>"), because branch-3.0 doesn't prune for repartition by expression.

Yea, thanks for adding the test.

maropu · 2020-07-05T11:49:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

        if (nestedFieldToAlias.nonEmpty &&
-            nestedFieldToAlias
-              .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
+          dedupNestedFields.map(_.canonicalized.asInstanceOf[ExtractValue])


nit: we don't need the cast here?

This part is related to the query failure in the test? Looks it is just an optimization?

btw, we cannot avoid the generating duplicated aliases (name#1.first AS _gen_alias_52#52 and name#1.first AS _gen_alias_53#53) below? Is this technically difficult? (This is not related to this PR and just a question)

scala> sql("select name.first from contact_alias").explain() == Physical Plan == *(2) Project [_gen_alias_52#52 AS first#50] +- Exchange hashpartitioning(_gen_alias_53#53, _gen_alias_54#54, 100), false, [id=#46] +- *(1) Project [name#1.first AS _gen_alias_52#52, name#1.first AS _gen_alias_53#53, name#1.last AS _gen_alias_54#54] +- FileScan parquet [name#1,p#8] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/contacts], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<first:string,last:string>>

Due to cosmetic variations, Extractors with different qualifiers, for example, will cause incorrect total number of fields.

For the duplicated aliases, we can avoid it. We can work on canonicalized extractors when generating aliases.

But we also need to convert coming extractors to canonicalized versions when we look up into the alias map.

Currently the code looks clear. And it seems not a big deal, and I think it is rare case that there are multiple extractors with cosmetic difference. So currently I don't try to do that.

For the duplicated aliases, we can avoid it. We can work on canonicalized extractors when generating aliases.
But we also need to convert coming extractors to canonicalized versions when we look up into the alias map.
Currently the code looks clear. And it seems not a big deal, and I think it is rare case that there are multiple extractors with cosmetic difference. So currently I don't try to do that.

Thanks for the explanation. Looks okay.

viirya · 2020-07-05T18:01:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+      val query2 = sql("select friends.middle, col from contact_alias")
+      checkScan(query2, "struct<friends:array<struct<first:string,middle:string>>>")
+      checkAnswer(query2, Row(Array("Z."), "Susan") :: Nil)


@maropu This test is for branch-3.0.

Thank you for the test case.

maropu · 2020-07-05T23:30:26Z

Looks okay and anyone could check this? @dongjoon-hyun @dbtsai

SparkQA · 2020-07-06T00:59:35Z

Test build #124928 has finished for PR 28988 at commit 04f6bb6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-06T01:01:26Z

retest this please

maropu · 2020-07-06T01:19:38Z

FYI: @zhengruifeng @ScrapCodes (Not sure which one is a release manager though) I think this fix's better to be included in the v3.0.1 release.

SparkQA · 2020-07-06T06:30:34Z

Test build #124991 has finished for PR 28988 at commit 04f6bb6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-06T06:38:49Z

retest this please...

viirya · 2020-07-06T07:14:14Z

retest this please

maropu · 2020-07-06T14:46:41Z

retest this please

viirya · 2020-07-06T23:29:49Z

The Jenkins looks unstable.

maropu · 2020-07-06T23:35:43Z

Yeah, Shane is working hard on this issue now, so we need to wait a little until it stabilizes.

viirya · 2020-07-07T02:55:41Z

retest this please

SparkQA · 2020-07-07T07:05:02Z

Test build #125166 has finished for PR 28988 at commit 04f6bb6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-07T07:51:38Z

retest this please

SparkQA · 2020-07-07T14:35:11Z

Test build #125194 has finished for PR 28988 at commit 04f6bb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

dongjoon-hyun

+1, LGTM. Thank you so much, @viirya , @maropu , @frankyin-factual , @HyukjinKwon .
Merged to master.

The last commit is a only-indentation fix.

dongjoon-hyun · 2020-07-07T18:18:55Z

Could you adjust the test case and make a backporting PR against branch-3.0, @viirya?

cc @dbtsai

viirya · 2020-07-07T18:23:46Z

Thanks all. Sure, let me create a backporting PR for branch-3.0.

SparkQA · 2020-07-07T23:03:42Z

Test build #125233 has finished for PR 28988 at commit d352dbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… variations ### What changes were proposed in this pull request? This patch proposes to deal with cosmetic variations when processing nested column extractors in `NestedColumnAliasing`. Currently if cosmetic variations are in the nested column extractors, the query is not optimized. This backports #28988 to branch-3.0. ### Why are the changes needed? If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug in nested column pruning. ### How was this patch tested? Unit test. Closes #29027 from viirya/SPARK-32163-3.0. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

nested pruning should work even with cosmetic variations.

4c047a1

probot-autolabeler bot added the SQL label Jul 3, 2020

maropu reviewed Jul 5, 2020

View reviewed changes

Add test case.

5f111b4

viirya commented Jul 5, 2020

View reviewed changes

Remove unnessary cast.

04f6bb6

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Jul 7, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Outdated Show resolved Hide resolved

Fix ident.

d352dbc

dongjoon-hyun approved these changes Jul 7, 2020

View reviewed changes

dongjoon-hyun closed this in 90b9099 Jul 7, 2020

viirya mentioned this pull request Jul 7, 2020

[SPARK-32163][SQL][3.0] Nested pruning should work even with cosmetic variations #29027

Closed

viirya deleted the SPARK-32163 branch December 27, 2023 18:28

	private def canProjectPushThrough(plan: LogicalPlan) = plan match {
	case _: GlobalLimit => true
	case _: LocalLimit => true
	case _: Repartition => true
	case _: Sample => true
	case _ => false

[SPARK-32163][SQL] Nested pruning should work even with cosmetic variations #28988

[SPARK-32163][SQL] Nested pruning should work even with cosmetic variations #28988

Uh oh!

Conversation

viirya commented Jul 3, 2020 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Jul 3, 2020

Uh oh!

ukby1234 commented Jul 3, 2020

Uh oh!

viirya commented Jul 3, 2020

Uh oh!

viirya commented Jul 3, 2020

Uh oh!

viirya commented Jul 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 5, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

viirya commented Jul 6, 2020

Uh oh!

maropu commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

viirya commented Jul 6, 2020

Uh oh!

This comment has been minimized.

viirya commented Jul 6, 2020

Uh oh!

This comment has been minimized.

maropu commented Jul 6, 2020

Uh oh!

This comment has been minimized.

This comment has been minimized.

viirya commented Jul 6, 2020

Uh oh!

maropu commented Jul 6, 2020

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

viirya commented Jul 7, 2020

Uh oh!

SparkQA commented Jul 7, 2020

Uh oh!

viirya commented Jul 3, 2020 •

edited by dongjoon-hyun

Loading

viirya Jul 5, 2020 •

edited

Loading

maropu Jul 5, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading