[SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join #28556

viirya · 2020-05-17T05:49:24Z

What changes were proposed in this pull request?

Currently we only push nested column pruning through a few operators such as LIMIT, SAMPLE, etc. This patch extends the feature to other operators including RepartitionByExpression, Join.

Why are the changes needed?

Currently nested column pruning only applied on a few operators. It limits the benefit of nested column pruning. Extending nested column pruning coverage to make this feature more generally applied through different queries.

Does this PR introduce any user-facing change?

Yes. More SQL operators are covered by nested column pruning.

How was this patch tested?

Added unit test, end-to-end tests.

SparkQA · 2020-05-17T07:05:01Z

Test build #122750 has finished for PR 28556 at commit 2c95e81.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-05-17T07:06:11Z

retest this please

SparkQA · 2020-05-17T13:02:05Z

Test build #122752 has finished for PR 28556 at commit 2c95e81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-05-17T16:26:59Z

cc @dongjoon-hyun @dbtsai @cloud-fan @HyukjinKwon

viirya · 2020-05-17T16:27:38Z

cc @maropu

maropu · 2020-05-18T01:53:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

    case Project(projectList, child)
        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
-      getAliasSubMap(projectList)
+      val exprsToPrune = projectList ++ child.expressions


nit: exprsToPrune -> exprCandidatesToPrune?

maropu · 2020-05-18T01:54:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

-    })
+    }).transformExpressions {
+      case f: ExtractValue if nestedFieldToAlias.contains(f) =>
+        nestedFieldToAlias(f).toAttribute


~~Is this change needed only for supporting joins?~~

No, for example RepartitionByExpression also needs this change.

Ah, I got it. It seems this change is related to https://github.com/apache/spark/pull/28556/files#diff-43334bab9616cc53e8797b9afa9fc7aaL207-L215

Actually the operators which have expressions should need this to replace ExtractValue and nested column aliases.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

maropu · 2020-05-18T02:02:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

-    val newGenerate = g.copy(generator = newGenerator)
-
-    NestedColumnAliasing.replaceChildrenWithAliases(newGenerate, attrToAliases)
+    NestedColumnAliasing.replaceChildrenWithAliases(g, nestedFieldToAlias, attrToAliases)


nit: I think we need to update the method name of replaceChildrenWithAliases. We don't need Children in the name, anymore?

Changed the method name.

SparkQA · 2020-05-21T07:05:02Z

Test build #122916 has finished for PR 28556 at commit b77a1ba.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-05-21T07:06:59Z

retest this please

viirya · 2020-05-21T08:36:54Z

@maropu I addressed your comments. Could you help take another look? Thanks.

SparkQA · 2020-05-21T11:55:41Z

Test build #122917 has finished for PR 28556 at commit b77a1ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-05-21T13:00:14Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala

  test("Pushing a single nested field projection - negative") {
    val ops = Seq(
      (input: LogicalPlan) => input.distribute('name)(1),
-      (input: LogicalPlan) => input.distribute($"name.middle")(1),


Ah, looks nice. This PR could support this case.

maropu · 2020-05-21T13:00:51Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala

+      .analyze
+    comparePlans(optimized1, expected1)
+
+


nit: unnecessary line break.

maropu · 2020-05-21T13:05:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+    checkAnswer(query3, Row("abc") :: Row(null) :: Nil)
+  }
+
+  testSchemaPruning("select one deep nested complex field after outer join") {


Thanks for adding the tests.

maropu · 2020-05-21T13:06:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+    "struct<contactId:int>")
+    checkAnswer(query1, Row("X.") :: Row("Y.") :: Nil)
+
+    val query2 = sql("select contacts.name.middle from contacts, departments where " +


nit: I think its better to use uppercases for SQL keywords where possible.

Seems all tests in this test suite are using lowercases. Changing all tests seems too bothering... :)

maropu · 2020-05-21T13:08:30Z

I left some minor comments though, it looks okay. cc: @dongjoon-hyun @dbtsai

SparkQA · 2020-05-21T13:18:40Z

Test build #122922 has finished for PR 28556 at commit db601df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-22T07:05:02Z

Test build #122966 has finished for PR 28556 at commit f720bdf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-05-22T07:23:26Z

retest this please

SparkQA · 2020-05-22T12:38:44Z

Test build #122977 has finished for PR 28556 at commit f720bdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-05-27T07:16:08Z

retest this please

SparkQA · 2020-05-27T12:08:56Z

Test build #123168 has finished for PR 28556 at commit f720bdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-06-02T07:08:06Z

ping @cloud-fan @dongjoon-hyun

…ning

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

SparkQA · 2020-06-11T03:07:50Z

Test build #123785 has finished for PR 28556 at commit ce5d8dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-12T07:07:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

+      val exprCandidatesToPrune = projectList ++ child.expressions
+      getAliasSubMap(exprCandidatesToPrune, child.producedAttributes.toSeq)

    case plan if SQLConf.get.nestedSchemaPruningEnabled && canPruneOn(plan) =>


No big deal but I would rename plan to p to avoid shadowing the plan argument. At least my IDE complains on that.

I will change it in other PR. Thanks.

HyukjinKwon · 2020-06-12T07:39:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
-      getAliasSubMap(projectList)
+      val exprCandidatesToPrune = projectList ++ child.expressions
+      getAliasSubMap(exprCandidatesToPrune, child.producedAttributes.toSeq)


@viirya, just to clarify, you added producedAttributes here just to be safe but not related to the current changes (?). Seems Join and RepartitionByExpression have an empty producedAttributes.

Okay, if it's going to output, it shouldn't be pruned anyway.

HyukjinKwon · 2020-06-12T07:54:33Z

Merged to master.

Nested column aliasing for other operators.

2c95e81

probot-autolabeler bot added the SQL label May 17, 2020

viirya mentioned this pull request May 17, 2020

[SPARK-27217][SQL] Nested column aliasing for more operators which can prune nested column #28560

Closed

maropu reviewed May 18, 2020

View reviewed changes

viirya changed the title ~~[SPARK-31736][SQL] Nested column aliasing for other operators~~ [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join May 21, 2020

Address some comments.

b77a1ba

Add outer join tests.

db601df

maropu reviewed May 21, 2020

View reviewed changes

Remove unnecessary blank line.

f720bdf

maropu approved these changes May 23, 2020

View reviewed changes

viirya added 2 commits June 10, 2020 14:40

Merge remote-tracking branch 'upstream/master' into others-column-pru…

719a2ad

…ning

Address comment.

ce5d8dc

viirya commented Jun 10, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Show resolved Hide resolved

HyukjinKwon reviewed Jun 12, 2020

View reviewed changes

HyukjinKwon approved these changes Jun 12, 2020

View reviewed changes

HyukjinKwon closed this in ff89b11 Jun 12, 2020

viirya deleted the others-column-pruning branch December 27, 2023 18:23

[SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join #28556

[SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join #28556

Uh oh!

Conversation

viirya commented May 17, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented May 17, 2020

Uh oh!

viirya commented May 17, 2020

Uh oh!

SparkQA commented May 17, 2020

Uh oh!

viirya commented May 17, 2020

Uh oh!

viirya commented May 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 21, 2020

Uh oh!

viirya commented May 21, 2020

Uh oh!

viirya commented May 21, 2020

Uh oh!

SparkQA commented May 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented May 21, 2020

Uh oh!

SparkQA commented May 21, 2020

Uh oh!

SparkQA commented May 22, 2020

Uh oh!

viirya commented May 22, 2020

Uh oh!

SparkQA commented May 22, 2020

Uh oh!

viirya commented May 27, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

viirya commented Jun 2, 2020

Uh oh!

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

HyukjinKwon Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu May 18, 2020 •

edited

Loading

HyukjinKwon Jun 12, 2020 •

edited

Loading