Add Copy-on-Write support for Iceberg DELETE, UPDATE, and MERGE operations#27844
Add Copy-on-Write support for Iceberg DELETE, UPDATE, and MERGE operations#27844nookcreed wants to merge 6 commits intotrinodb:masterfrom
Conversation
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
db9c88a to
ec3f1e1
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
2 similar comments
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
1b20636 to
829bc66
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
1 similar comment
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
b133d54 to
2b50da5
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
2b50da5 to
23d8676
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
@nookcreed just check, have you signed & set cla yet? |
@chenjian2664 I have submitted the signed form. Waiting for approval. |
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
1 similar comment
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
@chenjian2664 can you assign someone to review? |
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
1 similar comment
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
If others are interested, they will chime in—this is an open-source project :) I briefly reviewed the test, and my understanding is that COW mainly controls the write behavior. For row-level changes (for example, deletes), it does not write delete files; instead, it writes new data files that contain the final results. |
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
@chenjian2664 @nookcreed I am willing to chime in this, just testcases and other fixes are required here ? I can add on to this PR. |
5f5a453 to
6e304ca
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
|
|
||
| // Perform UPDATE operations | ||
| // Update single row | ||
| assertUpdate("UPDATE " + tableName + " SET name = 'Robert', value = 250 WHERE id = 2", 1); |
There was a problem hiding this comment.
After performing a row-level change operation, could you verify the resulting file changes?
Specifically, if some rows in data file f1 are updated, then under copy-on-write mode we should not see f1 in the table state anymore, because f1 is rewritten into a new data file. also, there should be no delete files associated with the original f1
you can use https://trino.io/docs/current/connector/iceberg.html#files-table to get the files info, so use it to compare the file info before & after the operation
There was a problem hiding this comment.
- Added a getDataFilePaths() helper that queries "table$files" WHERE content = 0 to get the set of data file paths before and after each operation
- For DELETE, UPDATE, and MERGE: assert that the file set after the operation has no overlap with the original files (i.e., the original data files are fully replaced, not just marked with deletes)
- Added explicit assertions that SELECT count(*) FROM "table$files" WHERE content != 0 returns 0, confirming no delete files are created in CoW mode
- Added a testCopyOnWriteVsMergeOnRead comparison test that verifies MoR creates delete files (content != 0) while CoW does not
6e304ca to
4332927
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
4332927 to
192a805
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
|
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
192a805 to
5564e9f
Compare
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Varun Srinivas.
|
|
@chenjian2664 @kekwan PTAL |
|
@amoghmargoor @raunaqmorarka FYA as well |
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
Address review feedback: assert that copy-on-write DELETE, UPDATE, and MERGE operations replace original data files with rewritten ones and produce no delete files, using Iceberg's $files system table.
Add file-level verification to testUpdateWithCopyOnWrite to confirm that original data files are replaced after update operations, matching the pattern already used in DELETE and MERGE tests.
5564e9f to
683e2bc
Compare
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Varun Srinivas.
|
Description
This change implements configurable copy-on-write (CoW) mode for row-level operations in Iceberg tables, giving users control over the read vs write performance trade-off.
Motivation:
Iceberg supports two approaches for row-level modifications:
Users with read-heavy workloads on frequently updated tables have been requesting CoW support to eliminate read-time overhead of merging delete files with data files. This is particularly valuable for:
Implementation:
This change adds three new table properties to control write mode per operation type:
Key changes:
Testing:
Documentation:
Limitations:
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: