-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35756][SQL] unionByName supports struct having same col names but different sequence #32972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think it works in current master, no? |
bfb7582 to
5f95595
Compare
yes this is |
|
How does it relate to #32448? |
|
They're mostly different issues. This is more of a semantics thing. If you have two nested structs with the same fields, but in a different order, you have to set So I do think this idea makes sense, but I don't think the implementation handles multiple levels of nested structs correctly. I think the logic would have to be added to |
@Kimahriman - Not able to understand how missing columns added null nested column in this PR for allowMissingCol is false. In this PR we are only calling the addFields when both source and target side have same columns on both sides ( target.toAttributes.map(attr => attr.name).sorted == source.toAttributes.map(x => x.name).sorted). so this missingFieldsOpt is always empty , and we just do a sorting when allowMissingCol is false. Please do let me know if my understanding is not correct here. |
|
Yeah I'm saying it only properly handles one level of a nested struct, not recursively like it should. Got your code running to show an example: The inner struct gets merged adding missing columns even though allowMissingCol is false |
Thank you for explaining the nested struct scenario. For allowMissingCol as false , we need to do only the sort so instead of using addFields method used sortStructFields added the unit test for the same |
5e7592b to
323a3f9
Compare
|
This will definitely conflict with #32448, but I can update as necessary if this is accepted and goes in first. Still waiting on some feedback on that. |
|
cc @viirya , @HyukjinKwon @cloud-fan - Please review this Pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we handle different column order for top-level columns right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this handled in the code where each of the left attribute it adds its corresponding right side
val rightProjectList = leftOutputAttrs.map { lattr =>
val found = rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }
if (found.isDefined) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit fragile as it doesn't consider case sensitivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnionByName with allowMissing columns true add it as the missing column in case of case senistive attributes for both the scenarios spark.sql.caseSensitive as true and false
case class UnionClass2(a: Int, c: String)
case class UnionClass4(A: Int, b: Long)
case class UnionClass1a(a: Int, b: Long, nested: UnionClass2)
case class UnionClass1c(a: Int, b: Long, nested: UnionClass4)
val df1 = Seq((0, UnionClass1a(0, 1L, UnionClass2(1, "2")))).toDF("id", "a")
val df2 = Seq((1, UnionClass1c(1, 2L, UnionClass4(2, 3L)))).toDF("id", "a")
case 1 - set spark.sql.caseSensitive=false
scala> spark.sql("set spark.sql.caseSensitive=false")
res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> var unionDf = df1.unionByName(df2, true)
unionDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, a: struct<a: int, b: bigint ... 1 more field>]
scala> unionDf.schema.toDDL
res7: String = `id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, `nested`: STRUCT<`a`: INT, `b`: BIGINT, `c`: STRING, `A`: INT>>
case 2 -> when spark.sql.caseSensitive is enabled
scala> spark.sql("set spark.sql.caseSensitive=true")
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala>
scala> var unionDf = df1.unionByName(df2, true)
unionDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, a: struct<a: int, b: bigint ... 1 more field>]
scala> unionDf.schema.toDDL
res3: String = `id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, `nested`: STRUCT<`A`: INT, `a`: INT, `b`: BIGINT, `c`: STRING>>
for UnionbyName without allowMissing -> we cannot add the missing column ,it should give the exception with schema not same so union cannot be done so that is the reason for not converting into lower case and than do the comparison
> var unionDf = df1.unionByName(df2)
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<a:int,b:bigint,nested:struct<A:int,b:bigint>> <> struct<a:int,b:bigint,nested:struct<a:int,c:string>> at the second column of the second table;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he means more like
df1 = spark.createDataFrame([Row(nested=Row(a=1, b=2))])
df2 = spark.createDataFrame([Row(nested=Row(B=1, A=2))])
df1.unionByName(df2)
These wouldn't get merged without doing case insensitive comparisons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation is not required
target.toAttributes.map(attr => attr.name).sorted == source.toAttributes.map(x => x.name).sorted
Since the we need to sort both left and right side in case of struct and sortStructFields method recursively sorts in all the nested struct.
Now after removing this validation case sensitive attributes with different sequence are also working
as its working for the case sensitive attributes with same sequence
|
There is one scenario which is failing for the allowMissing as true in the existing master branch, This is the scenario where there is sorting done on the case C2 and c1 here after sorting it gives C2, c1 where as on the other side if c1, c2 so it fails with Union can only be performed on tables with the compatible column types. There is need to do a sorting on the lower case to handle this case. After fix |
|
#33040 was merged so this needs to be reworked based off that now |
…ol as false for nested structs
…ort recursively both left and right side of the union
… case attributes name
Thanks for sharing the details. Will update the PR as per the new change done |
f43b8d5 to
a82dd58
Compare
| var unionDF = df1.unionByName(df2) | ||
| var expected = Row(1, Row(1, 2)) :: Row(1, Row(2, 1)) :: Nil | ||
| val schema = "`a` INT,`b` STRUCT<`c1`: INT, `c2`: INT>" | ||
| assert(unionDF.schema.toDDL === schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit fragile to compare the DDL string, can we compare StructType instance directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the StructType comparison
|
@cloud-fan - Shall we trigger the test build for this PR, seems like its not triggered |
|
ok to test |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #140443 has finished for PR 32972 at commit
|
|
@cloud-fan - All the test passed on this PR. If every thing looks good than, shall we merge this PR |
|
thanks, merging to master! |
What changes were proposed in this pull request?
unionByName does not supports struct having same col names but different sequence
it gives the exception
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c2:int,c1:int> <> struct<c1:int,c2:int> at the second column of the second table; 'Union false, false :- LocalRelation [_1#38, _2#39] +- LocalRelation _1#45, _2#46In this case the col names are same so this unionByName should have the support to check within in the Struct if col names are same it should not throw this exception and works.
after fix we are getting the result
Why are the changes needed?
As per unionByName functionality based on name, does the union. In the case of struct this scenario was missing where all the columns names are same but sequence is different, so added this functionality.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added the unit test and also done the testing through spark shell