[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling #33040

Kimahriman · 2021-06-23T11:18:26Z

What changes were proposed in this pull request?

This PR changes the unionByName with null filling logic to append new nested struct fields from the right side of the union to the schema versus sorting fields alphabetically. It removes the need to use UpdateField expressions, and just directly projects new nested structs from each side of the union with the correct schema. This changes the union'd schema from being alphabetically sorted previously to now "left dominant", where the fields from the left side of the union are included and then the missing ones from the right are added in the same order found originally.

Why are the changes needed?

Certain nested structs would cause unionByName with null filling to error out due to part of the logic for rewriting the expression tree to sort the structs.

Does this PR introduce any user-facing change?

Yes, nested struct fields will be in a different order after unionByName with null filling than before, though shouldn't cause much effective difference.

How was this patch tested?

Updated existing tests based on the new StructField ordering and added a new test for the case that was broken originally.

…filling

Kimahriman · 2021-06-23T11:19:49Z

Rework of #32448 with just the unionByName fixes and without StructType.merge changes that weren't necessary anymore for this

HyukjinKwon · 2021-06-23T13:15:29Z

Thanks, @Kimahriman.

HyukjinKwon · 2021-06-23T13:15:34Z

ok to test

HyukjinKwon · 2021-06-23T13:15:41Z

cc @viirya FYI

SparkQA · 2021-06-23T14:18:37Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44733/

cloud-fan · 2021-06-23T15:34:27Z

docs/sql-migration-guide.md


  - In Spark 3.2, the query executions triggered by `DataFrameWriter` are always named `command` when being sent to `QueryExecutionListener`. In Spark 3.1 and earlier, the name is one of `save`, `insertInto`, `saveAsTable`.
+
+  - In Spark 3.2, `Dataset.unionByName` with `allowMissingColumns` set to true will add missing nested fields to the end of structs. In Spark 3.1, nested struct fields are sorted alphabetically.


union of top-level columns is also "left dominant", this makes sense.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

cloud-fan · 2021-06-23T16:46:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

+        case (Some(cf), expectedType: StructType) if cf.dataType.isInstanceOf[StructType] =>
+            val extractedValue = ExtractValue(col, Literal(cf.name), resolver)
+            val combinedStruct = addFields(extractedValue, expectedType)
+            if (extractedValue.nullable) {


It's hard to see why null handling is needed here. I think we should move the null handling to where we return CreateNamedStruct

which is https://github.com/apache/spark/pull/33040/files#diff-84dd17265dcadd59f6ad9e649203d38b808485c7b5bd3937136222378f2ed27dR79

Yeah I just copied this from the old version. @viirya is there a reason all these nullable checks needed to be added? I don't know what you mean by move the null handling to where we return CreateNamedStruct though

Err maybe I didn't copy that directly from somewhere, just saw it used other places. Not sure if/where/when we need the nullable checks

Do you mean:

val newExpr = CreateNamedStruct(existingExprs) If (expr.nullable) { If(IsNull(expr), Literal(null, newExpr.dataType), newExpr) } else { newExpr }

Otherwise I don't find nullable check.

Yea, the code posted by @viirya is more widely used in the spark codebase. The rationale is pretty simple: CreateNamedStruct never return null and we need to handle the case if input is null.

Yeah, why was that needed? I copied that I guess because I thought it might be needed here as well for some reason. Is that basically trying to keep nulls down to the lowest level of the struct, instead of taking a null struct and making a non-null struct with all null values?

Moved the null check to the CreatedNamedStruct

viirya · 2021-06-23T17:15:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

+    colType.fields
+      .filter(f => targetType.fields.find(tf => resolver(f.name, tf.name)).isEmpty)
+      .foreach { f =>
+        newStructFields ++= Literal(f.name) :: ExtractValue(col, Literal(f.name), resolver) :: Nil


Is this to add fields only in left side at the end of struct? Doesn't it match original field order?

When the left is projected this should match the original, but when the right is projected this will contain things on the right that aren't in the left. Basically it's

rightChild = left ++ (right - left) leftChild = rightChild ++ (left - rightChild) = rightChild

Where is the project? Do you mean

val rightChild = Project(rightProjectList ++ notFoundAttrs, right)

?

It is top level column projection. I mean the nested column field order.

newStructFields contains the (nested) struct fields both in left and right column in right order.

Then here it adds (nested) struct fields only in left back to newStructFields, before create new struct (CreateNamedStruct).

Do we reorder the fields later?

rightProjectList contains the nested structs mapped in the order of left fields then remaining right fields recursively, so that's where all the reordering happens

And then leftChild is created from the fields in rightChild which already has all the fields as that point, which is the left fields and then the right fields

rightProjectList contains the nested structs mapped in the order of left fields then remaining right fields recursively, so that's where all the reordering happens

The projection projects original right attributes to rightProjectList. If you have different nested column order, it will be projected to new order.

The projection is not for reordering the nested column.

I look the code more in details.

targetType is actually left side type in first call. So here we align a right struct column to left struct column.

So it makes sense to add left nested columns first (newStructFields), then add nested columns only in right struct.

Yeah I kept most of the naming which gets a little weird with how left/right/source/target are constructed and depends on where in the codepath you are

viirya · 2021-06-23T17:15:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

      }
-    }
+
+    CreateNamedStruct(newStructFields.toSeq)


I think @cloud-fan means to add the null check here.

Yeah I think I understand now

SparkQA · 2021-06-23T17:51:40Z

Test build #140205 has finished for PR 33040 at commit b0042da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

After addressing the null check and method comment, I think this should be fine.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

…e unused variable

SparkQA · 2021-06-24T12:33:45Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44793/

SparkQA · 2021-06-24T15:26:07Z

Test build #140261 has finished for PR 33040 at commit eb8ddfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-06-24T16:21:00Z

Thanks @Kimahriman! Merging to master.

cfmcgrady · 2021-06-24T16:33:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

-            // like that. We will sort columns in the struct expression to make sure two sides of
-            // union have consistent schema.
+            // We have two structs with different types, so make sure the two structs have their
+            // fields in the same order by using `target`'s fields and then inluding any remaining


nit: inluding -> including

…or unionByName with null filling This PR changes the unionByName with null filling logic to append new nested struct fields from the right side of the union to the schema versus sorting fields alphabetically. It removes the need to use UpdateField expressions, and just directly projects new nested structs from each side of the union with the correct schema. This changes the union'd schema from being alphabetically sorted previously to now "left dominant", where the fields from the left side of the union are included and then the missing ones from the right are added in the same order found originally. Certain nested structs would cause unionByName with null filling to error out due to part of the logic for rewriting the expression tree to sort the structs. Yes, nested struct fields will be in a different order after unionByName with null filling than before, though shouldn't cause much effective difference. Updated existing tests based on the new StructField ordering and added a new test for the case that was broken originally. Closes apache#33040 from Kimahriman/union-by-name-struct-order. Authored-by: Adam Binford <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>

Maintain existing struct field order when unioning by name with null …

b0042da

…filling

github-actions bot added DOCS SQL labels Jun 23, 2021

Kimahriman mentioned this pull request Jun 23, 2021

[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling #32448

Closed

cloud-fan reviewed Jun 23, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala Show resolved Hide resolved

cloud-fan reviewed Jun 23, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala Show resolved Hide resolved

cloud-fan reviewed Jun 23, 2021

View reviewed changes

viirya reviewed Jun 23, 2021

View reviewed changes

cfmcgrady reviewed Jun 24, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala Outdated Show resolved Hide resolved

Move null check to new named struct, update method comment, and remov…

eb8ddfb

…e unused variable

cloud-fan approved these changes Jun 24, 2021

View reviewed changes

viirya approved these changes Jun 24, 2021

View reviewed changes

viirya closed this in 14b1836 Jun 24, 2021

cfmcgrady reviewed Jun 24, 2021

View reviewed changes

Kimahriman mentioned this pull request Jun 29, 2021

[SPARK-35756][SQL] unionByName supports struct having same col names but different sequence #32972

Closed


		- In Spark 3.2, the query executions triggered by `DataFrameWriter` are always named `command` when being sent to `QueryExecutionListener`. In Spark 3.1 and earlier, the name is one of `save`, `insertInto`, `saveAsTable`.

		- In Spark 3.2, `Dataset.unionByName` with `allowMissingColumns` set to true will add missing nested fields to the end of structs. In Spark 3.1, nested struct fields are sorted alphabetically.

[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling #33040

[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling #33040

Uh oh!

Conversation

Kimahriman commented Jun 23, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Kimahriman commented Jun 23, 2021

Uh oh!

HyukjinKwon commented Jun 23, 2021

Uh oh!

HyukjinKwon commented Jun 23, 2021

Uh oh!

HyukjinKwon commented Jun 23, 2021

Uh oh!

SparkQA commented Jun 23, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 23, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Jun 24, 2021

Uh oh!

SparkQA commented Jun 24, 2021

Uh oh!

viirya commented Jun 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jun 23, 2021 •

edited

Loading

viirya Jun 23, 2021 •

edited

Loading