Skip to content

Fix table_changes incorrect results when querying cow tables#27827

Closed
chenjian2664 wants to merge 2 commits intotrinodb:masterfrom
chenjian2664:jack/table-changes-cow-fix
Closed

Fix table_changes incorrect results when querying cow tables#27827
chenjian2664 wants to merge 2 commits intotrinodb:masterfrom
chenjian2664:jack/table-changes-cow-fix

Conversation

@chenjian2664
Copy link
Copy Markdown
Contributor

@chenjian2664 chenjian2664 commented Jan 2, 2026

Description

Iceberg table_changes may return duplicate(incorrect) rows when querying tables written using the copy-on-write (CoW) update model.

In a CoW write path, an engine may update only a subset of rows within a data file. During this process, the unchanged rows from the original file are rewritten into a new data file together with the updated rows, while the original file is removed. Iceberg represents the removed file using DeletedDataFileScanTask .

However, DeletedDataFileScanTask does not differentiate between rows that are actually deleted or updated and rows that are merely rewritten due to the copy-on-write process. As a result, table_changes can incorrectly interpret unchanged rows as deleted, leading to incorrect change semantics and, in some cases, duplicate results.

Example of failures

spark> create table t (x int, y varchar);
// snapshot 7527505804807355682
spark> insert into t values (5, 'a'), (4, 'b')

spark> insert into t values (5, 'a'), (4, 'b')

spark> update t set y = 'updated' where x = 5;
// snapshot 1538643479657750339

trino> select * from TABLE(system.table_changes(schema_name => 'default', table_name => 't2', start_snapshot_id => 7527505804807355682, end_snapshot_id => 1538643479657750339)) order by _change_ordinal;
returns 
 x |    y    | _change_type | _change_version_id  |      _change_timestamp      | _change_ordinal
---+---------+--------------+---------------------+-----------------------------+-----------------
 5 | a       | insert       | 1710304824426268447 | 2025-12-22 08:56:03.496 UTC |               0
 4 | b       | insert       | 1710304824426268447 | 2025-12-22 08:56:03.496 UTC |               0
 5 | a       | insert       | 1219451281551863158 | 2025-12-22 08:56:14.544 UTC |               1
 4 | b       | insert       | 1219451281551863158 | 2025-12-22 08:56:14.544 UTC |               1
 5 | a       | delete       | 1538643479657750339 | 2025-12-22 09:00:31.760 UTC |               2
 5 | a       | delete       | 1538643479657750339 | 2025-12-22 09:00:31.760 UTC |               2
 4 | b       | delete       | 1538643479657750339 | 2025-12-22 09:00:31.760 UTC |               2
 4 | b       | insert       | 1538643479657750339 | 2025-12-22 09:00:31.760 UTC |               2
 4 | b       | delete       | 1538643479657750339 | 2025-12-22 09:00:31.760 UTC |               2
 5 | updated | insert       | 1538643479657750339 | 2025-12-22 09:00:31.760 UTC |               2
 5 | updated | insert       | 1538643479657750339 | 2025-12-22 09:00:31.760 UTC |               2
 4 | b       | insert       | 1538643479657750339 | 2025-12-22 09:00:31.760 UTC |               2

The _change_ordinal equals 2 of the records that x equals 4 are duplicated

Workaround Approach

This PR introduces a minimal, planner-independent workaround to handle duplicates:
1. Splits are grouped by partition and change ordinality to capture all changes from a single operation (update or delete).
2. For CoW tables, the CopyOnWriteTableChangesFunctionProcessor counts row-level changes:
* Insert → +1
* Delete → -1
The final count determines whether the row has an actual change. Only rows with a net change are returned.

This approach is inspired by Spark's RemoveCarryoverIterator, which deduplicates changes by comparing adjacent rows. Unlike Spark, this implementation avoids post-scan sorting and does not require modifying the Trino planner.

Trade-offs:

  • Parallelism during scanning may be reduced, especially for unpartitioned tables.
    • also increased the memory usage within table function processor
  • Changes are minimal, backward-compatible, and do not alter the behavior of the table_changes procedure.

This ensures that table_changes correctly reflects logical row-level changes for copy-on-write tables without introducing false deletions or duplicates.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Fix `table_changes` incorrect results when querying cow tables. ({issue}`27827`)

@cla-bot cla-bot bot added the cla-signed label Jan 2, 2026
@chenjian2664 chenjian2664 marked this pull request as draft January 2, 2026 13:51
@github-actions github-actions bot added the iceberg Iceberg connector label Jan 2, 2026
@chenjian2664 chenjian2664 force-pushed the jack/table-changes-cow-fix branch 5 times, most recently from 6f4980a to cd85dfa Compare January 5, 2026 09:46
@chenjian2664 chenjian2664 marked this pull request as ready for review January 5, 2026 09:58
this.icebergTable = requireNonNull(icebergTable, "table is null");
this.tableScan = requireNonNull(tableScan, "tableScan is null");
this.targetSplitSize = tableScan.targetSplitSize();
this.delegate = switch (rowLevelOperationMode(icebergTable)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to consider the row-level operation mode per each snapshot and not only for the latest snapshot of the table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not possible to check files in per snapshot were written by cow or mor mode :(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think you are right. We may want to include the write information, specifically the merge_mode, in the snapshot metadata. Perhaps we should propose this to the Iceberg community. What do you think?

Copy link
Copy Markdown
Contributor Author

@chenjian2664 chenjian2664 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears we are facing the following situation: we rely on the MERGE_MODE property to infer how files were written. However, some engines may ignore this property, which can lead to incorrect results.

Previously, the behavior was:

  • If the files were written in copy-on-write mode, the results were incorrect.
  • If the files were written in merge-on-read mode, the results were correct.

With this PR:

  • If the files were written in COW mode (even in some snapshots) but the table property MERGE_MODE is set to merge-on-read, incorrect results are produced.
  • In all other cases, the results are correct.

@chenjian2664 chenjian2664 force-pushed the jack/table-changes-cow-fix branch 2 times, most recently from 7b8dcb4 to db0d026 Compare January 6, 2026 06:54
@chenjian2664 chenjian2664 force-pushed the jack/table-changes-cow-fix branch from db0d026 to 0ff3495 Compare January 7, 2026 08:21
@github-actions
Copy link
Copy Markdown

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Jan 28, 2026
@chenjian2664
Copy link
Copy Markdown
Contributor Author

Let me close it now, the behavior for cow tables isn't treated as "bug" though it is strange

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

2 participants