Fix table_changes incorrect results when querying cow tables#27827
Fix table_changes incorrect results when querying cow tables#27827chenjian2664 wants to merge 2 commits intotrinodb:masterfrom
table_changes incorrect results when querying cow tables#27827Conversation
6f4980a to
cd85dfa
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergExecutorModule.java
Outdated
Show resolved
Hide resolved
| this.icebergTable = requireNonNull(icebergTable, "table is null"); | ||
| this.tableScan = requireNonNull(tableScan, "tableScan is null"); | ||
| this.targetSplitSize = tableScan.targetSplitSize(); | ||
| this.delegate = switch (rowLevelOperationMode(icebergTable)) { |
There was a problem hiding this comment.
i think we need to consider the row-level operation mode per each snapshot and not only for the latest snapshot of the table.
There was a problem hiding this comment.
It seems not possible to check files in per snapshot were written by cow or mor mode :(
There was a problem hiding this comment.
But I think you are right. We may want to include the write information, specifically the merge_mode, in the snapshot metadata. Perhaps we should propose this to the Iceberg community. What do you think?
There was a problem hiding this comment.
It appears we are facing the following situation: we rely on the MERGE_MODE property to infer how files were written. However, some engines may ignore this property, which can lead to incorrect results.
Previously, the behavior was:
- If the files were written in
copy-on-writemode, the results were incorrect. - If the files were written in
merge-on-readmode, the results were correct.
With this PR:
- If the files were written in COW mode (even in some snapshots) but the table property
MERGE_MODEis set tomerge-on-read, incorrect results are produced. - In all other cases, the results are correct.
7b8dcb4 to
db0d026
Compare
db0d026 to
0ff3495
Compare
|
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
|
Let me close it now, the behavior for cow tables isn't treated as "bug" though it is strange |
Description
Iceberg
table_changesmay return duplicate(incorrect) rows when querying tables written using the copy-on-write (CoW) update model.In a CoW write path, an engine may update only a subset of rows within a data file. During this process, the unchanged rows from the original file are rewritten into a new data file together with the updated rows, while the original file is removed. Iceberg represents the removed file using
DeletedDataFileScanTask.However,
DeletedDataFileScanTaskdoes not differentiate between rows that are actually deleted or updated and rows that are merely rewritten due to the copy-on-write process. As a result,table_changescan incorrectly interpret unchanged rows as deleted, leading to incorrect change semantics and, in some cases, duplicate results.Example of failures
The
_change_ordinalequals 2 of the records thatxequals4are duplicatedWorkaround Approach
This PR introduces a minimal, planner-independent workaround to handle duplicates:
1. Splits are grouped by partition and change ordinality to capture all changes from a single operation (update or delete).
2. For CoW tables, the CopyOnWriteTableChangesFunctionProcessor counts row-level changes:
* Insert → +1
* Delete → -1
The final count determines whether the row has an actual change. Only rows with a net change are returned.
This approach is inspired by Spark's
RemoveCarryoverIterator, which deduplicates changes by comparing adjacent rows. Unlike Spark, this implementation avoids post-scan sorting and does not require modifying the Trino planner.Trade-offs:
table_changesprocedure.This ensures that
table_changescorrectly reflects logical row-level changes for copy-on-write tables without introducing false deletions or duplicates.Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: