-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54595][SQL] Keep existing behavior of MERGE INTO without SCHEMA EVOLUTION clause #53326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
HI @dongjoon-hyun fyi, @aokolnychyi tested this MERGE INTO WITH SCHEMA EVOLUTION feature more and had this feedback. Will try to get consensus today, but preparing the pr just in case, its a simple one. The rationale for this pr is that its safer to keep behavior without SCHEMA EVOLUTION clause (where Spark used to throw exception), and we can relax it later, rather than the other way around. Note its the case without SCHEMA EVOLUTION clause, it doesnt change the behavior of the new feature (with the SCHEMA EVOLUTION clause) |
|
Thank you for informing that, @szehon-ho . |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the failure, @szehon-ho ?
- MERGE INTO TABLE - primary *** FAILED *** (19 milliseconds)
updateAssigns.apply(0).key.asInstanceOf[org.apache.spark.sql.catalyst.expressions.AttributeReference].sameRef(ts) was
|
Gentle ping, @szehon-ho . |
|
ah let me take a look |
|
Fixed it, thanks! |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @szehon-ho .
Merged to master/4.1.
…A EVOLUTION clause ### What changes were proposed in this pull request? Keep existing behavior for MERGE INTO without SCHEMA EVOLUTION clause for UPDATE SET * and INSERT * as well as UPDATE struct or INSERT struct, to throw exception if the source and target schemas are not exactly the same. ### Why are the changes needed? As aokolnychyi tested this feature, he mentioned that as of Spark 4.1 the behavior is changed for MERGE INTO but without SCHEMA EVOLUTION clause. In particular: - Source has less columns/nested fields than target => we fill with NULL or DEFAULT for inserts, and existing value for Update. (though we disabled for nested structs by default in [[SPARK-54525](https://issues.apache.org/jira/browse/SPARK-54525))](https://github.com/apache/spark/pull/53229) - Source has more columns/fields than target => we drop the extra fields. Initially, I thought its a good improvement of MERGE INTO and is not related to SCHEMA EVOLUTION exactly because the schema is not altered. But Anton has a good point that it may be a surprise to some user. So it may be better for now to be more conservative and keep the exact same behavior for without SCHEMA EVOLUTION clause. Note: this behavior is still enabled if SCHEMA EVOLUTION is specified, as the user then is more explicit about the decision. ### Does this PR introduce _any_ user-facing change? No, this keeps behavior exactly the same as 4.0 without SCHEMA EVOLUTION clause. ### How was this patch tested? Added a test and changed existing test output to expect the exception if SCHEMA EVOLUTION is not specified. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53326 from szehon-ho/merge_restriction. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 74b6a93) Signed-off-by: Dongjoon Hyun <[email protected]>
|
My bad. I only checked the previous one and missed another failure. Let me revert this PR. |
|
Please ping me when you have a working PR once more, @szehon-ho . |
|
Ah, i think it conflicted with the concurrent pr. I will try to take a look |
|
Actually was able to fix the test quickly: #53363 thanks again @dongjoon-hyun ! |
What changes were proposed in this pull request?
Keep existing behavior for MERGE INTO without SCHEMA EVOLUTION clause for UPDATE SET * and INSERT * as well as UPDATE struct or INSERT struct, to throw exception if the source and target schemas are not exactly the same.
Why are the changes needed?
As @aokolnychyi tested this feature, he mentioned that as of Spark 4.1 the behavior is changed for MERGE INTO but without SCHEMA EVOLUTION clause.
In particular:
Initially, I thought its a good improvement of MERGE INTO and is not related to SCHEMA EVOLUTION exactly because the schema is not altered. But Anton has a good point that it may be a surprise to some user. So it may be better for now to be more conservative and keep the exact same behavior for without SCHEMA EVOLUTION clause.
Note: this behavior is still enabled if SCHEMA EVOLUTION is specified, as the user then is more explicit about the decision.
Does this PR introduce any user-facing change?
No, this keeps behavior exactly the same as 4.0 without SCHEMA EVOLUTION clause.
How was this patch tested?
Added a test and changed existing test output to expect the exception if SCHEMA EVOLUTION is not specified.
Was this patch authored or co-authored using generative AI tooling?
No