[SPARK-50820][SQL] DSv2: Conditional nullification of metadata columns in DML #49493

aokolnychyi · 2025-01-15T03:55:26Z

What changes were proposed in this pull request?

This PR introduces conditional nullification of metadata columns in DELETE, UPDATE, and MERGE operations. Previously, connectors could project metadata columns in a row-level operation, but the metadata values were always preserved and could not be nullified. After this change, connectors control which metadata columns preserved and when.

The new behavior is implemented via flags in metadataInJSON exposed for MetadataAttribute. This PR also extends the existing DataWriter and DeltaWriter interfaces.

Why are the changes needed?

These changes are essential to support row lineage in Iceberg and Delta Lake. Both projects define a row ID and a row version as part of their metadata concepts. The row ID is a synthetic metadata column that is null when a record is first inserted and becomes assigned through inheritance. Once assigned, the row ID must remain constant and unaltered. In contrast, the row version is updated with every modification and must be re-assigned. The existing implementation of DELETE, UPDATE, and MERGE operations in Spark doesn't support conditional metadata column nullification required to support row lineage.

Suppose there is a table containing the following rows:

 dep |   name    | salary | _row_lineage_id | _row_lineage_version |     _file      | _pos
-----+-----------+--------+-----------------+----------------------+----------------+------
 hr  | Alice     |  200   | 101             | v1                   | fileA.parquet  | 0
 hr  | Robert    |  240   | 102             | v1                   | fileA.parquet  | 1
 it  | Charlie   |  260   | 103             | v1                   | fileA.parquet  | 2
 it  | Bob       |  220   | 104             | v1                   | fileA.parquet  | 3

Then UPDATE t SET salary = salary + 10 WHERE dep = 'hr' should produce:

 operation | row_id (_file, _pos)        | row (dep, name, salary)  | metadata (_row_lineage_id, _row_lineage_version)
-----------+-----------------------------+--------------------------+-------------------------------------------------
 update    | (fileA.parquet, 0)          | (hr, Alice, 210)         | (101, null)
 update    | (fileA.parquet, 1)          | (hr, Robert, 250)        | (102, null)

Note that _row_lineage_id values are preserved but _row_lineage_version are nullified.

Does this PR introduce any user-facing change?

Yes, but the changes are backward compatible due to default values for added flags.

How was this patch tested?

This PR comes with unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java

aokolnychyi · 2025-01-15T04:02:37Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java

Follows exactly what we have in Column.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ProjectingInternalRow.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

aokolnychyi · 2025-01-15T04:15:07Z

cc @viirya @huaxingao @cloud-fan @allisonwang-db @rdblue @dongjoon-hyun

cloud-fan · 2025-01-15T07:25:36Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java

shall we simply name the new write function in the base class as insert?

Actually, it may not always be a new record to insert in DataWriter. In group-based DELETE, UPDATE, and MERGE operations that replace entire files in Delta and Iceberg, certain records have to be copied over. That means those records aren't really inserts. Leaving the method name as write in DeltaWriter keeps its purpose fairly generic and allows us to use it beyond simple inserts.

This may change too. Trying an idea.

dongjoon-hyun

Could you rebase this PR to the master branch, @aokolnychyi ? According to the CI failure, it seems to be affected by the bug which is fixed in the master branch already. I hope it doesn't hide other real bugs.

…s in DML

aokolnychyi · 2025-01-22T03:29:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ReplaceDataProjections.scala

+
+import org.apache.spark.sql.catalyst.ProjectingInternalRow
+
+case class ReplaceDataProjections(


Similar to WriteDeltaProjections we already have but specific to ReplaceData.

cloud-fan · 2025-01-22T05:56:35Z

thanks, merging to master/4.0!

…s in DML ### What changes were proposed in this pull request? This PR introduces conditional nullification of metadata columns in DELETE, UPDATE, and MERGE operations. Previously, connectors could project metadata columns in a row-level operation, but the metadata values were always preserved and could not be nullified. After this change, connectors control which metadata columns preserved and when. The new behavior is implemented via flags in `metadataInJSON` exposed for `MetadataAttribute`. This PR also extends the existing `DataWriter` and `DeltaWriter` interfaces. ### Why are the changes needed? These changes are essential to support row lineage in Iceberg and Delta Lake. Both projects define a row ID and a row version as part of their metadata concepts. The row ID is a synthetic metadata column that is null when a record is first inserted and becomes assigned through inheritance. Once assigned, the row ID must remain constant and unaltered. In contrast, the row version is updated with every modification and must be re-assigned. The existing implementation of DELETE, UPDATE, and MERGE operations in Spark doesn't support conditional metadata column nullification required to support row lineage. Suppose there is a table containing the following rows: ``` dep | name | salary | _row_lineage_id | _row_lineage_version | _file | _pos -----+-----------+--------+-----------------+----------------------+----------------+------ hr | Alice | 200 | 101 | v1 | fileA.parquet | 0 hr | Robert | 240 | 102 | v1 | fileA.parquet | 1 it | Charlie | 260 | 103 | v1 | fileA.parquet | 2 it | Bob | 220 | 104 | v1 | fileA.parquet | 3 ``` Then `UPDATE t SET salary = salary + 10 WHERE dep = 'hr'` should produce: ``` operation | row_id (_file, _pos) | row (dep, name, salary) | metadata (_row_lineage_id, _row_lineage_version) -----------+-----------------------------+--------------------------+------------------------------------------------- update | (fileA.parquet, 0) | (hr, Alice, 210) | (101, null) update | (fileA.parquet, 1) | (hr, Robert, 250) | (102, null) ``` Note that `_row_lineage_id` values are preserved but `_row_lineage_version` are nullified. ### Does this PR introduce _any_ user-facing change? Yes, but the changes are backward compatible due to default values for added flags. ### How was this patch tested? This PR comes with unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49493 from aokolnychyi/spark-50820. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8313320) Signed-off-by: Wenchen Fan <[email protected]>

aokolnychyi · 2025-01-22T16:56:18Z

Thank you, @cloud-fan @dongjoon-hyun!

huaxingao · 2025-03-11T04:06:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelCommand.scala

+      colOrdinals: Seq[Int],
+      attrs: Seq[Attribute]): ProjectingInternalRow = {
+    val schema = StructType(attrs.zipWithIndex.map { case (attr, index) =>
+      val nullable = outputs.exists(output => output(colOrdinals(index)).nullable)


@aokolnychyi Do we only need this for metadata columns? For regular columns, shall we use attr.nullable instead?

…s in DML ### What changes were proposed in this pull request? This PR introduces conditional nullification of metadata columns in DELETE, UPDATE, and MERGE operations. Previously, connectors could project metadata columns in a row-level operation, but the metadata values were always preserved and could not be nullified. After this change, connectors control which metadata columns preserved and when. The new behavior is implemented via flags in `metadataInJSON` exposed for `MetadataAttribute`. This PR also extends the existing `DataWriter` and `DeltaWriter` interfaces. ### Why are the changes needed? These changes are essential to support row lineage in Iceberg and Delta Lake. Both projects define a row ID and a row version as part of their metadata concepts. The row ID is a synthetic metadata column that is null when a record is first inserted and becomes assigned through inheritance. Once assigned, the row ID must remain constant and unaltered. In contrast, the row version is updated with every modification and must be re-assigned. The existing implementation of DELETE, UPDATE, and MERGE operations in Spark doesn't support conditional metadata column nullification required to support row lineage. Suppose there is a table containing the following rows: ``` dep | name | salary | _row_lineage_id | _row_lineage_version | _file | _pos -----+-----------+--------+-----------------+----------------------+----------------+------ hr | Alice | 200 | 101 | v1 | fileA.parquet | 0 hr | Robert | 240 | 102 | v1 | fileA.parquet | 1 it | Charlie | 260 | 103 | v1 | fileA.parquet | 2 it | Bob | 220 | 104 | v1 | fileA.parquet | 3 ``` Then `UPDATE t SET salary = salary + 10 WHERE dep = 'hr'` should produce: ``` operation | row_id (_file, _pos) | row (dep, name, salary) | metadata (_row_lineage_id, _row_lineage_version) -----------+-----------------------------+--------------------------+------------------------------------------------- update | (fileA.parquet, 0) | (hr, Alice, 210) | (101, null) update | (fileA.parquet, 1) | (hr, Robert, 250) | (102, null) ``` Note that `_row_lineage_id` values are preserved but `_row_lineage_version` are nullified. ### Does this PR introduce _any_ user-facing change? Yes, but the changes are backward compatible due to default values for added flags. ### How was this patch tested? This PR comes with unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49493 from aokolnychyi/spark-50820. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d27389e) Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Jan 15, 2025

aokolnychyi commented Jan 15, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java Outdated Show resolved Hide resolved

aokolnychyi commented Jan 15, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ProjectingInternalRow.scala Outdated Show resolved Hide resolved

aokolnychyi commented Jan 15, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 15, 2025

View reviewed changes

dongjoon-hyun reviewed Jan 15, 2025

View reviewed changes

aokolnychyi force-pushed the spark-50820 branch from 8b82457 to 582ead4 Compare January 15, 2025 22:58

[SPARK-50820][SQL] DSv2: Conditional nullification of metadata column…

e5ad809

…s in DML

aokolnychyi force-pushed the spark-50820 branch from 582ead4 to e5ad809 Compare January 20, 2025 03:24

aokolnychyi commented Jan 22, 2025

View reviewed changes

cloud-fan closed this in 8313320 Jan 22, 2025

huaxingao reviewed Mar 11, 2025

View reviewed changes

aokolnychyi mentioned this pull request Apr 2, 2025

[SPARK-51479][SQL] Nullable in Row Level Operation Column is not correct #50246

Closed

amogh-jahagirdar mentioned this pull request Apr 23, 2025

Spark 4.0 integration apache/iceberg#12494

Merged

amogh-jahagirdar mentioned this pull request Apr 30, 2025

Spark 3.5: Update MERGE and UPDATE for row lineage apache/iceberg#12736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-50820][SQL] DSv2: Conditional nullification of metadata columns in DML #49493

[SPARK-50820][SQL] DSv2: Conditional nullification of metadata columns in DML #49493

Uh oh!

aokolnychyi commented Jan 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

aokolnychyi Jan 15, 2025

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Jan 15, 2025

Uh oh!

cloud-fan Jan 15, 2025

Uh oh!

aokolnychyi Jan 15, 2025 •

edited

Loading

Uh oh!

aokolnychyi Jan 16, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

aokolnychyi Jan 22, 2025

Uh oh!

cloud-fan commented Jan 22, 2025

Uh oh!

aokolnychyi commented Jan 22, 2025

Uh oh!

huaxingao Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		import org.apache.spark.sql.catalyst.ProjectingInternalRow

		case class ReplaceDataProjections(

[SPARK-50820][SQL] DSv2: Conditional nullification of metadata columns in DML #49493

[SPARK-50820][SQL] DSv2: Conditional nullification of metadata columns in DML #49493

Uh oh!

Conversation

aokolnychyi commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

aokolnychyi Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Jan 15, 2025

Uh oh!

cloud-fan Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 22, 2025

Uh oh!

aokolnychyi commented Jan 22, 2025

Uh oh!

huaxingao Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aokolnychyi commented Jan 15, 2025 •

edited

Loading

aokolnychyi Jan 15, 2025 •

edited

Loading