Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Dec 4, 2025

What changes were proposed in this pull request?

Keep existing behavior for MERGE INTO without SCHEMA EVOLUTION clause for UPDATE SET * and INSERT * as well as UPDATE struct or INSERT struct, to throw exception if the source and target schemas are not exactly the same.

Why are the changes needed?

As @aokolnychyi tested this feature, he mentioned that as of Spark 4.1 the behavior is changed for MERGE INTO but without SCHEMA EVOLUTION clause.

In particular:

Initially, I thought its a good improvement of MERGE INTO and is not related to SCHEMA EVOLUTION exactly because the schema is not altered. But Anton has a good point that it may be a surprise to some user. So it may be better for now to be more conservative and keep the exact same behavior for without SCHEMA EVOLUTION clause.

Note: this behavior is still enabled if SCHEMA EVOLUTION is specified, as the user then is more explicit about the decision.

Does this PR introduce any user-facing change?

No, this keeps behavior exactly the same as 4.0 without SCHEMA EVOLUTION clause.

How was this patch tested?

Added a test and changed existing test output to expect the exception if SCHEMA EVOLUTION is not specified.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Dec 4, 2025
@szehon-ho
Copy link
Member Author

HI @dongjoon-hyun fyi, @aokolnychyi tested this MERGE INTO WITH SCHEMA EVOLUTION feature more and had this feedback. Will try to get consensus today, but preparing the pr just in case, its a simple one.

The rationale for this pr is that its safer to keep behavior without SCHEMA EVOLUTION clause (where Spark used to throw exception), and we can relax it later, rather than the other way around.

Note its the case without SCHEMA EVOLUTION clause, it doesnt change the behavior of the new feature (with the SCHEMA EVOLUTION clause)

@dongjoon-hyun
Copy link
Member

Thank you for informing that, @szehon-ho .

@szehon-ho szehon-ho changed the title [SPARK-54595][SQL] Keep existing behavior without SCHEMA EVOLUTION clause [SPARK-54595][SQL] Keep existing behavior of MERGE INTO without SCHEMA EVOLUTION clause Dec 4, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the failure, @szehon-ho ?

- MERGE INTO TABLE - primary *** FAILED *** (19 milliseconds)
updateAssigns.apply(0).key.asInstanceOf[org.apache.spark.sql.catalyst.expressions.AttributeReference].sameRef(ts) was 

@dongjoon-hyun
Copy link
Member

Gentle ping, @szehon-ho .

@szehon-ho
Copy link
Member Author

ah let me take a look

@szehon-ho
Copy link
Member Author

Fixed it, thanks!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @szehon-ho .
Merged to master/4.1.

dongjoon-hyun pushed a commit that referenced this pull request Dec 6, 2025
…A EVOLUTION clause

### What changes were proposed in this pull request?
Keep existing behavior for MERGE INTO without SCHEMA EVOLUTION clause for UPDATE SET * and INSERT * as well as UPDATE struct or INSERT struct, to throw exception if the source and target schemas are not exactly the same.

### Why are the changes needed?
As aokolnychyi  tested this feature, he mentioned that as of Spark 4.1 the behavior is changed for MERGE INTO but without SCHEMA EVOLUTION clause.

In particular:

- Source has less columns/nested fields than target => we fill with NULL or DEFAULT for inserts, and existing value for Update. (though we disabled for nested structs by default in [[SPARK-54525](https://issues.apache.org/jira/browse/SPARK-54525))](https://github.com/apache/spark/pull/53229)
- Source has more columns/fields than target => we drop the extra fields.

Initially, I thought its a good improvement of MERGE INTO and is not related to SCHEMA EVOLUTION exactly because the schema is not altered. But Anton has a good point that it may be a surprise to some user. So it may be better for now to be more conservative and keep the exact same behavior for without SCHEMA EVOLUTION clause.

Note: this behavior is still enabled if SCHEMA EVOLUTION is specified, as the user then is more explicit about the decision.

### Does this PR introduce _any_ user-facing change?
No, this keeps behavior exactly the same as 4.0 without SCHEMA EVOLUTION clause.

### How was this patch tested?
Added a test and changed existing test output to expect the exception if SCHEMA EVOLUTION is not specified.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #53326 from szehon-ho/merge_restriction.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 74b6a93)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

My bad. I only checked the previous one and missed another failure. Let me revert this PR.

- merge with struct missing nested field with check constraint *** FAILED ***
Expected exception org.apache.spark.SparkRuntimeException to be thrown, but org.apache.spark.sql.AnalysisException was thrown (MergeIntoTableSuiteBase.scala:5359)

@dongjoon-hyun
Copy link
Member

This is reverted

@dongjoon-hyun
Copy link
Member

Please ping me when you have a working PR once more, @szehon-ho .

@szehon-ho
Copy link
Member Author

szehon-ho commented Dec 6, 2025

Ah, i think it conflicted with the concurrent pr. I will try to take a look

@szehon-ho
Copy link
Member Author

Actually was able to fix the test quickly: #53363 thanks again @dongjoon-hyun !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants