-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50820][SQL] DSv2: Conditional nullification of metadata columns in DML #49493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follows exactly what we have in Column.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ProjectingInternalRow.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we simply name the new write function in the base class as insert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it may not always be a new record to insert in DataWriter. In group-based DELETE, UPDATE, and MERGE operations that replace entire files in Delta and Iceberg, certain records have to be copied over. That means those records aren't really inserts. Leaving the method name as write in DeltaWriter keeps its purpose fairly generic and allows us to use it beyond simple inserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may change too. Trying an idea.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rebase this PR to the master branch, @aokolnychyi ? According to the CI failure, it seems to be affected by the bug which is fixed in the master branch already. I hope it doesn't hide other real bugs.
8b82457 to
582ead4
Compare
582ead4 to
e5ad809
Compare
|
|
||
| import org.apache.spark.sql.catalyst.ProjectingInternalRow | ||
|
|
||
| case class ReplaceDataProjections( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to WriteDeltaProjections we already have but specific to ReplaceData.
|
thanks, merging to master/4.0! |
…s in DML ### What changes were proposed in this pull request? This PR introduces conditional nullification of metadata columns in DELETE, UPDATE, and MERGE operations. Previously, connectors could project metadata columns in a row-level operation, but the metadata values were always preserved and could not be nullified. After this change, connectors control which metadata columns preserved and when. The new behavior is implemented via flags in `metadataInJSON` exposed for `MetadataAttribute`. This PR also extends the existing `DataWriter` and `DeltaWriter` interfaces. ### Why are the changes needed? These changes are essential to support row lineage in Iceberg and Delta Lake. Both projects define a row ID and a row version as part of their metadata concepts. The row ID is a synthetic metadata column that is null when a record is first inserted and becomes assigned through inheritance. Once assigned, the row ID must remain constant and unaltered. In contrast, the row version is updated with every modification and must be re-assigned. The existing implementation of DELETE, UPDATE, and MERGE operations in Spark doesn't support conditional metadata column nullification required to support row lineage. Suppose there is a table containing the following rows: ``` dep | name | salary | _row_lineage_id | _row_lineage_version | _file | _pos -----+-----------+--------+-----------------+----------------------+----------------+------ hr | Alice | 200 | 101 | v1 | fileA.parquet | 0 hr | Robert | 240 | 102 | v1 | fileA.parquet | 1 it | Charlie | 260 | 103 | v1 | fileA.parquet | 2 it | Bob | 220 | 104 | v1 | fileA.parquet | 3 ``` Then `UPDATE t SET salary = salary + 10 WHERE dep = 'hr'` should produce: ``` operation | row_id (_file, _pos) | row (dep, name, salary) | metadata (_row_lineage_id, _row_lineage_version) -----------+-----------------------------+--------------------------+------------------------------------------------- update | (fileA.parquet, 0) | (hr, Alice, 210) | (101, null) update | (fileA.parquet, 1) | (hr, Robert, 250) | (102, null) ``` Note that `_row_lineage_id` values are preserved but `_row_lineage_version` are nullified. ### Does this PR introduce _any_ user-facing change? Yes, but the changes are backward compatible due to default values for added flags. ### How was this patch tested? This PR comes with unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49493 from aokolnychyi/spark-50820. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8313320) Signed-off-by: Wenchen Fan <[email protected]>
|
Thank you, @cloud-fan @dongjoon-hyun! |
| colOrdinals: Seq[Int], | ||
| attrs: Seq[Attribute]): ProjectingInternalRow = { | ||
| val schema = StructType(attrs.zipWithIndex.map { case (attr, index) => | ||
| val nullable = outputs.exists(output => output(colOrdinals(index)).nullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi Do we only need this for metadata columns? For regular columns, shall we use attr.nullable instead?
…s in DML ### What changes were proposed in this pull request? This PR introduces conditional nullification of metadata columns in DELETE, UPDATE, and MERGE operations. Previously, connectors could project metadata columns in a row-level operation, but the metadata values were always preserved and could not be nullified. After this change, connectors control which metadata columns preserved and when. The new behavior is implemented via flags in `metadataInJSON` exposed for `MetadataAttribute`. This PR also extends the existing `DataWriter` and `DeltaWriter` interfaces. ### Why are the changes needed? These changes are essential to support row lineage in Iceberg and Delta Lake. Both projects define a row ID and a row version as part of their metadata concepts. The row ID is a synthetic metadata column that is null when a record is first inserted and becomes assigned through inheritance. Once assigned, the row ID must remain constant and unaltered. In contrast, the row version is updated with every modification and must be re-assigned. The existing implementation of DELETE, UPDATE, and MERGE operations in Spark doesn't support conditional metadata column nullification required to support row lineage. Suppose there is a table containing the following rows: ``` dep | name | salary | _row_lineage_id | _row_lineage_version | _file | _pos -----+-----------+--------+-----------------+----------------------+----------------+------ hr | Alice | 200 | 101 | v1 | fileA.parquet | 0 hr | Robert | 240 | 102 | v1 | fileA.parquet | 1 it | Charlie | 260 | 103 | v1 | fileA.parquet | 2 it | Bob | 220 | 104 | v1 | fileA.parquet | 3 ``` Then `UPDATE t SET salary = salary + 10 WHERE dep = 'hr'` should produce: ``` operation | row_id (_file, _pos) | row (dep, name, salary) | metadata (_row_lineage_id, _row_lineage_version) -----------+-----------------------------+--------------------------+------------------------------------------------- update | (fileA.parquet, 0) | (hr, Alice, 210) | (101, null) update | (fileA.parquet, 1) | (hr, Robert, 250) | (102, null) ``` Note that `_row_lineage_id` values are preserved but `_row_lineage_version` are nullified. ### Does this PR introduce _any_ user-facing change? Yes, but the changes are backward compatible due to default values for added flags. ### How was this patch tested? This PR comes with unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49493 from aokolnychyi/spark-50820. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d27389e) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR introduces conditional nullification of metadata columns in DELETE, UPDATE, and MERGE operations. Previously, connectors could project metadata columns in a row-level operation, but the metadata values were always preserved and could not be nullified. After this change, connectors control which metadata columns preserved and when.
The new behavior is implemented via flags in
metadataInJSONexposed forMetadataAttribute. This PR also extends the existingDataWriterandDeltaWriterinterfaces.Why are the changes needed?
These changes are essential to support row lineage in Iceberg and Delta Lake. Both projects define a row ID and a row version as part of their metadata concepts. The row ID is a synthetic metadata column that is null when a record is first inserted and becomes assigned through inheritance. Once assigned, the row ID must remain constant and unaltered. In contrast, the row version is updated with every modification and must be re-assigned. The existing implementation of DELETE, UPDATE, and MERGE operations in Spark doesn't support conditional metadata column nullification required to support row lineage.
Suppose there is a table containing the following rows:
Then
UPDATE t SET salary = salary + 10 WHERE dep = 'hr'should produce:Note that
_row_lineage_idvalues are preserved but_row_lineage_versionare nullified.Does this PR introduce any user-facing change?
Yes, but the changes are backward compatible due to default values for added flags.
How was this patch tested?
This PR comes with unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.