[SPARK-35756][SQL] unionByName supports struct having same col names but different sequence #32972

SaurabhChawla100 · 2021-06-19T08:55:17Z

What changes were proposed in this pull request?

unionByName does not supports struct having same col names but different sequence

val df1 = Seq((1, Struct1(1, 2))).toDF("a", "b")
val df2 = Seq((1, Struct2(1, 2))).toDF("a", "b")
val unionDF = df1.unionByName(df2)

it gives the exception

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c2:int,c1:int> <> struct<c1:int,c2:int> at the second column of the second table; 'Union false, false :- LocalRelation [_1#38, _2#39] +- LocalRelation _1#45, _2#46

In this case the col names are same so this unionByName should have the support to check within in the Struct if col names are same it should not throw this exception and works.

after fix we are getting the result

val unionDF = df1.unionByName(df2)
scala>  unionDF.show
+---+------+                                                                    
|  a|     b|
+---+------+
|  1|{1, 2}|
|  1|{2, 1}|
+---+------+

Why are the changes needed?

As per unionByName functionality based on name, does the union. In the case of struct this scenario was missing where all the columns names are same but sequence is different, so added this functionality.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added the unit test and also done the testing through spark shell

viirya · 2021-06-19T17:14:33Z

I think it works in current master, no?

scala> val df1 = Seq((1, Struct1(1, 2))).toDF("a", "b")
df1: org.apache.spark.sql.DataFrame = [a: int, b: struct<c1: int, c2: int>]

scala> val df2 = Seq((1, Struct2(1, 2))).toDF("a", "b")
df2: org.apache.spark.sql.DataFrame = [a: int, b: struct<c2: int, c1: int>]

scala> val unionDF = df1.unionByName(df2, true)
unionDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: struct<c1: int, c2: int>]

scala> unionDF.show
+---+------+
|  a|     b|
+---+------+
|  1|{1, 2}|
|  1|{2, 1}|
+---+------+

SaurabhChawla100 · 2021-06-19T19:49:33Z

I think it works in current master, no?

scala> val df1 = Seq((1, Struct1(1, 2))).toDF("a", "b")
df1: org.apache.spark.sql.DataFrame = [a: int, b: struct<c1: int, c2: int>]

scala> val df2 = Seq((1, Struct2(1, 2))).toDF("a", "b")
df2: org.apache.spark.sql.DataFrame = [a: int, b: struct<c2: int, c1: int>]

scala> val unionDF = df1.unionByName(df2, true)
unionDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: struct<c1: int, c2: int>]

scala> unionDF.show
+---+------+
|  a|     b|
+---+------+
|  1|{1, 2}|
|  1|{2, 1}|
+---+------+

yes this is val unionDF = df1.unionByName(df2, true) working in the current master but this is giving exception val unionDF = df1.unionByName(df2) , So the change is done to make this scenario working.

HyukjinKwon · 2021-06-20T03:02:42Z

How does it relate to #32448? StructType.merge can handle this case IIRC.

Kimahriman · 2021-06-20T03:46:52Z

They're mostly different issues. This is more of a semantics thing. If you have two nested structs with the same fields, but in a different order, you have to set allowMissingCol to true in order for the structs to be sorted, which isn't very intuitive. This is trying to make the ByName part apply to nested structs as well, and leave allowMissingCol to just actually apply to missing (possibly nested) columns.

So I do think this idea makes sense, but I don't think the implementation handles multiple levels of nested structs correctly. addFields assumes adding missing columns, so I think you could end up with a case that adds null nested columns even if allowMissingCol is false.

I think the logic would have to be added to addFields to handle whether or not it should add null missing columns.

SaurabhChawla100 · 2021-06-20T12:07:46Z

They're mostly different issues. This is more of a semantics thing. If you have two nested structs with the same fields, but in a different order, you have to set allowMissingCol to true in order for the structs to be sorted, which isn't very intuitive. This is trying to make the ByName part apply to nested structs as well, and leave allowMissingCol to just actually apply to missing (possibly nested) columns.

So I do think this idea makes sense, but I don't think the implementation handles multiple levels of nested structs correctly. addFields assumes adding missing columns, so I think you could end up with a case that adds null nested columns even if allowMissingCol is false.

I think the logic would have to be added to addFields to handle whether or not it should add null missing columns.

So I do think this idea makes sense, but I don't think the implementation handles multiple levels of nested structs correctly. addFields assumes adding missing columns, so I think you could end up with a case that adds null nested columns even if allowMissingCol is false.

@Kimahriman - Not able to understand how missing columns added null nested column in this PR for allowMissingCol is false.

  case (source: StructType, target: StructType)
            if !allowMissingCol && !source.sameType(target) &&
              target.toAttributes.map(attr => attr.name).sorted
                == source.toAttributes.map(x => x.name).sorted =>
            // Having an output with same name, but different struct type.
            // We will sort columns in the struct expression to make sure two sides of
            // union have consistent schema.
            aliased += foundAttr
            Alias(addFields(foundAttr, target), foundAttr.name)()

In this PR we are only calling the addFields when both source and target side have same columns on both sides ( target.toAttributes.map(attr => attr.name).sorted == source.toAttributes.map(x => x.name).sorted).

val missingFieldsOpt =
      StructType.findMissingFields(col.dataType.asInstanceOf[StructType], target, resolver)

so this missingFieldsOpt is always empty , and we just do a sorting when allowMissingCol is false.

 if (missingFieldsOpt.isEmpty) {
      sortStructFields(col)
    }

Please do let me know if my understanding is not correct here.

Kimahriman · 2021-06-20T13:22:43Z

Yeah I'm saying it only properly handles one level of a nested struct, not recursively like it should. Got your code running to show an example:

>>> df1 = spark.createDataFrame([Row(a=Row(aa=Row(aaa=1)))])
>>> df2 = spark.createDataFrame([Row(a=Row(aa=Row(aab=1)))])
>>> df1
DataFrame[a: struct<aa:struct<aaa:bigint>>]
>>> df2
DataFrame[a: struct<aa:struct<aab:bigint>>]
>>> df1.unionByName(df2)
DataFrame[a: struct<aa:struct<aaa:bigint,aab:bigint>>]
>>> df1.unionByName(df2).explain()
== Physical Plan ==
Union
:- *(1) Project [if (isnull(a#21)) null else named_struct(aa, if (isnull(a#21.aa)) null else named_struct(aaa, a#21.aa.aaa, aab, null)) AS a#37]
:  +- *(1) Scan ExistingRDD[a#21]
+- *(2) Project [if (isnull(a#23)) null else named_struct(aa, if (isnull(a#23.aa)) null else named_struct(aaa, null, aab, a#23.aa.aab)) AS a#34]
   +- *(2) Scan ExistingRDD[a#23]

The inner struct gets merged adding missing columns even though allowMissingCol is false

SaurabhChawla100 · 2021-06-20T15:05:16Z

Yeah I'm saying it only properly handles one level of a nested struct, not recursively like it should. Got your code running to show an example:

>>> df1 = spark.createDataFrame([Row(a=Row(aa=Row(aaa=1)))])
>>> df2 = spark.createDataFrame([Row(a=Row(aa=Row(aab=1)))])
>>> df1
DataFrame[a: struct<aa:struct<aaa:bigint>>]
>>> df2
DataFrame[a: struct<aa:struct<aab:bigint>>]
>>> df1.unionByName(df2)
DataFrame[a: struct<aa:struct<aaa:bigint,aab:bigint>>]
>>> df1.unionByName(df2).explain()
== Physical Plan ==
Union
:- *(1) Project [if (isnull(a#21)) null else named_struct(aa, if (isnull(a#21.aa)) null else named_struct(aaa, a#21.aa.aaa, aab, null)) AS a#37]
:  +- *(1) Scan ExistingRDD[a#21]
+- *(2) Project [if (isnull(a#23)) null else named_struct(aa, if (isnull(a#23.aa)) null else named_struct(aaa, null, aab, a#23.aa.aab)) AS a#34]
   +- *(2) Scan ExistingRDD[a#23]

The inner struct gets merged adding missing columns even though allowMissingCol is false

Thank you for explaining the nested struct scenario. For allowMissingCol as false , we need to do only the sort so instead of using addFields method used sortStructFields

added the unit test for the same

case class UnionClass1d(c1: Int, c2: Int, c3: Struct3)
case class UnionClass1e(c2: Int, c1: Int, c3: Struct4)
case class Struct3(c3: Int)
case class Struct4(c4: Int)

 df1 = Seq((1, 2, UnionClass1d(1, 2, Struct3(1)))).toDF("a", "b", "c")
 df2 = Seq((1, 2, UnionClass1e(1, 2, Struct4(1)))).toDF("a", "b", "c")

df1.unionByName(df2) -> This will not add the missing column, instead will throw exception
 "Union can only be performed on tables with the compatible column types." +
        " struct<c1:int,c2:int,c3:struct<c4:int>> <> struct<c1:int,c2:int,c3:struct<c3:int>>" +
        " at the third column of the second table"

Kimahriman · 2021-06-20T15:11:05Z

This will definitely conflict with #32448, but I can update as necessary if this is accepted and goes in first. Still waiting on some feedback on that.

SaurabhChawla100 · 2021-06-22T11:47:32Z

cc @viirya , @HyukjinKwon @cloud-fan - Please review this Pr

cloud-fan · 2021-06-22T14:09:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

how do we handle different column order for top-level columns right now?

this handled in the code where each of the left attribute it adds its corresponding right side

https://github.com/apache/spark/pull/32972/files#diff-84dd17265dcadd59f6ad9e649203d38b808485c7b5bd3937136222378f2ed27dR170

val rightProjectList = leftOutputAttrs.map { lattr => val found = rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) } if (found.isDefined) {

cloud-fan · 2021-06-22T14:09:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

This looks a bit fragile as it doesn't consider case sensitivity.

UnionByName with allowMissing columns true add it as the missing column in case of case senistive attributes for both the scenarios spark.sql.caseSensitive as true and false

case class UnionClass2(a: Int, c: String) case class UnionClass4(A: Int, b: Long) case class UnionClass1a(a: Int, b: Long, nested: UnionClass2) case class UnionClass1c(a: Int, b: Long, nested: UnionClass4) val df1 = Seq((0, UnionClass1a(0, 1L, UnionClass2(1, "2")))).toDF("id", "a") val df2 = Seq((1, UnionClass1c(1, 2L, UnionClass4(2, 3L)))).toDF("id", "a") case 1 - set spark.sql.caseSensitive=false scala> spark.sql("set spark.sql.caseSensitive=false") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> var unionDf = df1.unionByName(df2, true) unionDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, a: struct<a: int, b: bigint ... 1 more field>] scala> unionDf.schema.toDDL res7: String = `id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, `nested`: STRUCT<`a`: INT, `b`: BIGINT, `c`: STRING, `A`: INT>> case 2 -> when spark.sql.caseSensitive is enabled scala> spark.sql("set spark.sql.caseSensitive=true") res2: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> scala> var unionDf = df1.unionByName(df2, true) unionDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, a: struct<a: int, b: bigint ... 1 more field>] scala> unionDf.schema.toDDL res3: String = `id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, `nested`: STRUCT<`A`: INT, `a`: INT, `b`: BIGINT, `c`: STRING>>

for UnionbyName without allowMissing -> we cannot add the missing column ,it should give the exception with schema not same so union cannot be done so that is the reason for not converting into lower case and than do the comparison

> var unionDf = df1.unionByName(df2) org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<a:int,b:bigint,nested:struct<A:int,b:bigint>> <> struct<a:int,b:bigint,nested:struct<a:int,c:string>> at the second column of the second table;

I think he means more like

df1 = spark.createDataFrame([Row(nested=Row(a=1, b=2))]) df2 = spark.createDataFrame([Row(nested=Row(B=1, A=2))]) df1.unionByName(df2)

These wouldn't get merged without doing case insensitive comparisons

This validation is not required
target.toAttributes.map(attr => attr.name).sorted == source.toAttributes.map(x => x.name).sorted
Since the we need to sort both left and right side in case of struct and sortStructFields method recursively sorts in all the nested struct.

Now after removing this validation case sensitive attributes with different sequence are also working
as its working for the case sensitive attributes with same sequence

SaurabhChawla100 · 2021-06-23T14:43:15Z

There is one scenario which is failing for the allowMissing as true in the existing master branch, This is the scenario where there is sorting done on the case C2 and c1 here after sorting it gives C2, c1 where as on the other side if c1, c2 so it fails with Union can only be performed on tables with the compatible column types.

scala> case class Struct2(C2: Int, c1: Int)
defined class Struct2

scala> case class Struct1(c1: Int, c2: Int)
defined class Struct1

scala> var df2 = Seq((1, Struct2(1, 2))).toDF("a", "b")
df2: org.apache.spark.sql.DataFrame = [a: int, b: struct<C2: int, c1: int>]

scala> var df1 = Seq((1, Struct1(1, 2))).toDF("a", "b")
df1: org.apache.spark.sql.DataFrame = [a: int, b: struct<c1: int, c2: int>]

scala> var unionDF = df1.unionByName(df2, true)
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<C2:int,c1:int> <> struct<c1:int,c2:int> at the second column of the second table;
'Union false, false

There is need to do a sorting on the lower case to handle this case.

After fix

scala> case class Struct2(C2: Int, c1: Int)
defined class Struct2

scala> case class Struct1(c1: Int, c2: Int)
defined class Struct1

scala> var df1 = Seq((1, Struct1(1, 2))).toDF("a", "b")
df1: org.apache.spark.sql.DataFrame = [a: int, b: struct<c1: int, c2: int>]

scala> var df2 = Seq((1, Struct2(1, 2))).toDF("a", "b")
df2: org.apache.spark.sql.DataFrame = [a: int, b: struct<C2: int, c1: int>]

scala> var unionDF = df1.unionByName(df2, true)
unionDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: struct<c1: int, c2: int ... 1 more field>]

Kimahriman · 2021-06-29T11:13:44Z

#33040 was merged so this needs to be reworked based off that now

…ferent sequence

…ol as false for nested structs

…ort recursively both left and right side of the union

… case attributes name

SaurabhChawla100 · 2021-06-29T18:01:51Z

#33040 was merged so this needs to be reworked based off that now

Thanks for sharing the details. Will update the PR as per the new change done

cloud-fan · 2021-06-30T08:39:14Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala

+    var unionDF = df1.unionByName(df2)
+    var expected = Row(1, Row(1, 2)) :: Row(1, Row(2, 1)) :: Nil
+    val schema = "`a` INT,`b` STRUCT<`c1`: INT, `c2`: INT>"
+    assert(unionDF.schema.toDDL === schema)


It's a bit fragile to compare the DDL string, can we compare StructType instance directly?

added the StructType comparison

SaurabhChawla100 · 2021-06-30T14:24:56Z

@cloud-fan - Shall we trigger the test build for this PR, seems like its not triggered

cloud-fan · 2021-06-30T14:34:30Z

ok to test

SparkQA · 2021-06-30T15:41:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44957/

SparkQA · 2021-06-30T16:16:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44957/

SparkQA · 2021-06-30T19:11:57Z

Test build #140443 has finished for PR 32972 at commit a29df92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SaurabhChawla100 · 2021-07-01T11:35:18Z

@cloud-fan - All the test passed on this PR. If every thing looks good than, shall we merge this PR

cloud-fan · 2021-07-01T17:37:08Z

thanks, merging to master!

github-actions bot added the SQL label Jun 19, 2021

SaurabhChawla100 force-pushed the SPARK-35756 branch from bfb7582 to 5f95595 Compare June 19, 2021 19:46

SaurabhChawla100 force-pushed the SPARK-35756 branch from 5e7592b to 323a3f9 Compare June 20, 2021 15:06

cloud-fan reviewed Jun 22, 2021

View reviewed changes

Kimahriman mentioned this pull request Jun 23, 2021

[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling #33040

Closed

SaurabhChawla100 added 5 commits June 29, 2021 18:56

SPARK-35756: unionByName support struct having same col names but dif…

10d91be

…ferent sequence

build the logical plan at left side in case right hand side is build

cc9dd34

add the code change to used sortStructFields in case of allowMissingC…

907efc2

…ol as false for nested structs

remove the column name comparison in the validation size sortStruct s…

2b89666

…ort recursively both left and right side of the union

add the code change for fixing the combination of uppercase and lower…

c4756ef

… case attributes name

resolving the conflict as per the new change in resolveunion

a82dd58

SaurabhChawla100 force-pushed the SPARK-35756 branch from f43b8d5 to a82dd58 Compare June 30, 2021 07:03

Code refactoring for match case

b3b7755

cloud-fan reviewed Jun 30, 2021

View reviewed changes

cloud-fan approved these changes Jun 30, 2021

View reviewed changes

add struct for comparison instead of String schema

a29df92

cloud-fan approved these changes Jun 30, 2021

View reviewed changes

cloud-fan closed this in ca12176 Jul 1, 2021

[SPARK-35756][SQL] unionByName supports struct having same col names but different sequence #32972

[SPARK-35756][SQL] unionByName supports struct having same col names but different sequence #32972

Uh oh!

Conversation

SaurabhChawla100 commented Jun 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Jun 19, 2021

Uh oh!

SaurabhChawla100 commented Jun 19, 2021

Uh oh!

HyukjinKwon commented Jun 20, 2021

Uh oh!

Kimahriman commented Jun 20, 2021

Uh oh!

SaurabhChawla100 commented Jun 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kimahriman commented Jun 20, 2021

Uh oh!

SaurabhChawla100 commented Jun 20, 2021

Uh oh!

Kimahriman commented Jun 20, 2021

Uh oh!

SaurabhChawla100 commented Jun 22, 2021

Uh oh!

cloud-fan Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

Kimahriman Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kimahriman commented Jun 29, 2021

Uh oh!

SaurabhChawla100 commented Jun 29, 2021

Uh oh!

cloud-fan Jun 30, 2021

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 Jun 30, 2021

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 commented Jun 30, 2021

Uh oh!

cloud-fan commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

SaurabhChawla100 commented Jul 1, 2021

Uh oh!

cloud-fan commented Jul 1, 2021

Uh oh!

Reviewers

Assignees

Labels

SaurabhChawla100 commented Jun 19, 2021 •

edited

Loading

SaurabhChawla100 commented Jun 20, 2021 •

edited

Loading

cloud-fan Jun 22, 2021 •

edited

Loading

SaurabhChawla100 Jun 22, 2021 •

edited

Loading

SaurabhChawla100 commented Jun 23, 2021 •

edited

Loading