Support copy-on-write change for Iceberg Connector#26161
Closed
SongChujun wants to merge 1 commit intotrinodb:masterfrom
Closed
Support copy-on-write change for Iceberg Connector#26161SongChujun wants to merge 1 commit intotrinodb:masterfrom
SongChujun wants to merge 1 commit intotrinodb:masterfrom
Conversation
32a071b to
62fc70b
Compare
62fc70b to
b8cf821
Compare
8013ea3 to
f558e4e
Compare
Member
|
The connector intentionally doesn’t support CoW. I recall that some maintainers were opposed to adding support for it. I recommend reaching an agreement before starting the review process. |
317cadb to
3a38dd8
Compare
3a38dd8 to
6adddbe
Compare
Member
Member
|
@pettyjamesm It was an offline discussion. cc: @dain @electrum |
Contributor
|
How is this exposed to the user? |
6adddbe to
a1eaadc
Compare
adf574b to
9dc60d3
Compare
12574ac to
c873927
Compare
8ff875c to
e1f02b5
Compare
e1f02b5 to
47e1fd5
Compare
|
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
|
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
Contributor
|
@ebyhr What were some of the reasons for not supporting CoW? |
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Dec 23, 2025
…tions
This change implements configurable copy-on-write (CoW) mode for row-level
operations in Iceberg tables, giving users control over the read vs write
performance trade-off.
Motivation:
Iceberg supports two approaches for row-level modifications:
- Merge-on-Read (MoR): Fast writes, slower reads (current Trino behavior)
- Copy-on-Write (CoW): Slower writes, fast reads (previously unsupported)
Users with read-heavy workloads on frequently updated tables have been
requesting CoW support to eliminate read-time overhead of merging delete
files with data files. This is particularly valuable for:
- Analytics workloads with frequent small updates
- Tables with high read/write ratios
- Use cases requiring predictable read performance
Implementation:
This change adds three new table properties to control write mode per
operation type:
- `write_delete_mode`: 'merge-on-read' (default) or 'copy-on-write'
- `write_update_mode`: 'merge-on-read' (default) or 'copy-on-write'
- `write_merge_mode`: 'merge-on-read' (default) or 'copy-on-write'
Key changes:
1. **UpdateKind tracking**: Added UpdateKind enum (DELETE/UPDATE/MERGE) to
IcebergTableHandle to track operation type throughout query lifecycle
- `applyDelete()` sets UpdateKind.DELETE
- `getUpdateLayout()` sets UpdateKind.UPDATE
- `beginMerge()` sets UpdateKind.MERGE
2. **UpdateMode resolution**: Added UpdateMode enum (MERGE_ON_READ/COPY_ON_WRITE)
to resolve write mode from table properties based on UpdateKind
3. **CoW DELETE implementation**: Modified `finishWrite()` to detect CoW DELETE
operations and delegate to new `rewriteDataFilesForCowDelete()` method
- Reads manifests to locate DataFile objects for files to delete
- For each affected data file:
* Reads position deletes from delete files into Roaring64Bitmap
* Opens original data file via IcebergPageSourceProvider
* Filters deleted positions page-by-page using Block.copyPositions()
* Writes filtered data to new file with proper metrics
* Manages rollback lifecycle for cleanup on errors
- Uses Iceberg's RewriteFiles API for atomic commit
4. **Resource management**: Proper handling of rollback lifecycle
- Rollback handle kept until file successfully added to transaction
- Cleanup guaranteed via try-catch-finally blocks
- Suppressed exceptions preserve original error context
5. **Optimizations**:
- Direct manifest reading with early termination when all files found
- Efficient position delete filtering using Roaring64Bitmap
- Reuses existing PositionDeleteFilter implementation for consistency
6. **Safety features**:
- Snapshot isolation via Iceberg's validation mechanisms
- Null checks for empty tables
- Missing file validation before commit
- Format version check (requires v2 for row-level operations)
Testing:
- Added TestIcebergCopyOnWriteOperations with 17 comprehensive tests
* Basic operations (DELETE, UPDATE, MERGE)
* Partitioned tables
* Large batch operations (1000-5000 rows)
* Performance benchmarks with metrics collection
* Error cases (empty tables, format v1, etc.)
- Added TestIcebergCopyOnWriteDeleteOperations with 15 unit tests
* File rewriting with position deletes
* Error handling (IO exceptions, missing files, etc.)
* Resource cleanup verification
* Edge cases (empty deletes, equality deletes, etc.)
- Added TestIcebergCopyOnWriteIntegration with 5 integration tests
* Resource cleanup on failure
* Concurrent operations
* Snapshot isolation
* Conflict resolution
- Updated 8 test files to reflect new IcebergMetadata constructor signature
Documentation:
- Added comprehensive COPY_ON_WRITE_README.md with:
* Feature overview and motivation
* Configuration examples
* Usage patterns and recommendations
* Performance characteristics (CoW vs MoR comparison)
* Implementation deep dive with code examples
* Troubleshooting guide
* Known limitations and future enhancements
Limitations:
- CoW DELETE currently only supports position deletes
- Equality deletes documented as future enhancement
- Requires Iceberg format version 2 or higher
Fixes: trinodb#26161
Addresses design considerations from: trinodb#17272
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Dec 23, 2025
…tions
This change implements configurable copy-on-write (CoW) mode for row-level
operations in Iceberg tables, giving users control over the read vs write
performance trade-off.
Motivation:
Iceberg supports two approaches for row-level modifications:
- Merge-on-Read (MoR): Fast writes, slower reads (current Trino behavior)
- Copy-on-Write (CoW): Slower writes, fast reads (previously unsupported)
Users with read-heavy workloads on frequently updated tables have been
requesting CoW support to eliminate read-time overhead of merging delete
files with data files. This is particularly valuable for:
- Analytics workloads with frequent small updates
- Tables with high read/write ratios
- Use cases requiring predictable read performance
Implementation:
This change adds three new table properties to control write mode per
operation type:
- `write_delete_mode`: 'merge-on-read' (default) or 'copy-on-write'
- `write_update_mode`: 'merge-on-read' (default) or 'copy-on-write'
- `write_merge_mode`: 'merge-on-read' (default) or 'copy-on-write'
Key changes:
1. **UpdateKind tracking**: Added UpdateKind enum (DELETE/UPDATE/MERGE) to
IcebergTableHandle to track operation type throughout query lifecycle
- `applyDelete()` sets UpdateKind.DELETE
- `getUpdateLayout()` sets UpdateKind.UPDATE
- `beginMerge()` sets UpdateKind.MERGE
2. **UpdateMode resolution**: Added UpdateMode enum (MERGE_ON_READ/COPY_ON_WRITE)
to resolve write mode from table properties based on UpdateKind
3. **CoW DELETE implementation**: Modified `finishWrite()` to detect CoW DELETE
operations and delegate to new `rewriteDataFilesForCowDelete()` method
- Reads manifests to locate DataFile objects for files to delete
- For each affected data file:
* Reads position deletes from delete files into Roaring64Bitmap
* Opens original data file via IcebergPageSourceProvider
* Filters deleted positions page-by-page using Block.copyPositions()
* Writes filtered data to new file with proper metrics
* Manages rollback lifecycle for cleanup on errors
- Uses Iceberg's RewriteFiles API for atomic commit
4. **Resource management**: Proper handling of rollback lifecycle
- Rollback handle kept until file successfully added to transaction
- Cleanup guaranteed via try-catch-finally blocks
- Suppressed exceptions preserve original error context
5. **Optimizations**:
- Direct manifest reading with early termination when all files found
- Efficient position delete filtering using Roaring64Bitmap
- Reuses existing PositionDeleteFilter implementation for consistency
6. **Safety features**:
- Snapshot isolation via Iceberg's validation mechanisms
- Null checks for empty tables
- Missing file validation before commit
- Format version check (requires v2 for row-level operations)
Testing:
- Added TestIcebergCopyOnWriteOperations with 17 comprehensive tests
* Basic operations (DELETE, UPDATE, MERGE)
* Partitioned tables
* Large batch operations (1000-5000 rows)
* Performance benchmarks with metrics collection
* Error cases (empty tables, format v1, etc.)
- Added TestIcebergCopyOnWriteDeleteOperations with 15 unit tests
* File rewriting with position deletes
* Error handling (IO exceptions, missing files, etc.)
* Resource cleanup verification
* Edge cases (empty deletes, equality deletes, etc.)
- Added TestIcebergCopyOnWriteIntegration with 5 integration tests
* Resource cleanup on failure
* Concurrent operations
* Snapshot isolation
* Conflict resolution
- Updated 8 test files to reflect new IcebergMetadata constructor signature
Documentation:
- Added comprehensive COPY_ON_WRITE_README.md with:
* Feature overview and motivation
* Configuration examples
* Usage patterns and recommendations
* Performance characteristics (CoW vs MoR comparison)
* Implementation deep dive with code examples
* Troubleshooting guide
* Known limitations and future enhancements
Limitations:
- CoW DELETE currently only supports position deletes
- Equality deletes documented as future enhancement
- Requires Iceberg format version 2 or higher
Fixes: trinodb#26161
Addresses design considerations from: trinodb#17272
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Dec 30, 2025
…tions
This change implements configurable copy-on-write (CoW) mode for row-level
operations in Iceberg tables, giving users control over the read vs write
performance trade-off.
Motivation:
Iceberg supports two approaches for row-level modifications:
- Merge-on-Read (MoR): Fast writes, slower reads (current Trino behavior)
- Copy-on-Write (CoW): Slower writes, fast reads (previously unsupported)
Users with read-heavy workloads on frequently updated tables have been
requesting CoW support to eliminate read-time overhead of merging delete
files with data files. This is particularly valuable for:
- Analytics workloads with frequent small updates
- Tables with high read/write ratios
- Use cases requiring predictable read performance
Implementation:
This change adds three new table properties to control write mode per
operation type:
- `write_delete_mode`: 'merge-on-read' (default) or 'copy-on-write'
- `write_update_mode`: 'merge-on-read' (default) or 'copy-on-write'
- `write_merge_mode`: 'merge-on-read' (default) or 'copy-on-write'
Key changes:
1. **UpdateKind tracking**: Added UpdateKind enum (DELETE/UPDATE/MERGE) to
IcebergTableHandle to track operation type throughout query lifecycle
- `applyDelete()` sets UpdateKind.DELETE
- `getUpdateLayout()` sets UpdateKind.UPDATE
- `beginMerge()` sets UpdateKind.MERGE
2. **UpdateMode resolution**: Added UpdateMode enum (MERGE_ON_READ/COPY_ON_WRITE)
to resolve write mode from table properties based on UpdateKind
3. **CoW DELETE implementation**: Modified `finishWrite()` to detect CoW DELETE
operations and delegate to new `rewriteDataFilesForCowDelete()` method
- Reads manifests to locate DataFile objects for files to delete
- For each affected data file:
* Reads position deletes from delete files into Roaring64Bitmap
* Opens original data file via IcebergPageSourceProvider
* Filters deleted positions page-by-page using Block.copyPositions()
* Writes filtered data to new file with proper metrics
* Manages rollback lifecycle for cleanup on errors
- Uses Iceberg's RewriteFiles API for atomic commit
4. **Resource management**: Proper handling of rollback lifecycle
- Rollback handle kept until file successfully added to transaction
- Cleanup guaranteed via try-catch-finally blocks
- Suppressed exceptions preserve original error context
5. **Optimizations**:
- Direct manifest reading with early termination when all files found
- Efficient position delete filtering using Roaring64Bitmap
- Reuses existing PositionDeleteFilter implementation for consistency
6. **Safety features**:
- Snapshot isolation via Iceberg's validation mechanisms
- Null checks for empty tables
- Missing file validation before commit
- Format version check (requires v2 for row-level operations)
Testing:
- Added TestIcebergCopyOnWriteOperations with 17 comprehensive tests
* Basic operations (DELETE, UPDATE, MERGE)
* Partitioned tables
* Large batch operations (1000-5000 rows)
* Performance benchmarks with metrics collection
* Error cases (empty tables, format v1, etc.)
- Added TestIcebergCopyOnWriteDeleteOperations with 15 unit tests
* File rewriting with position deletes
* Error handling (IO exceptions, missing files, etc.)
* Resource cleanup verification
* Edge cases (empty deletes, equality deletes, etc.)
- Added TestIcebergCopyOnWriteIntegration with 5 integration tests
* Resource cleanup on failure
* Concurrent operations
* Snapshot isolation
* Conflict resolution
- Updated 8 test files to reflect new IcebergMetadata constructor signature
Documentation:
- Added comprehensive COPY_ON_WRITE_README.md with:
* Feature overview and motivation
* Configuration examples
* Usage patterns and recommendations
* Performance characteristics (CoW vs MoR comparison)
* Implementation deep dive with code examples
* Troubleshooting guide
* Known limitations and future enhancements
Limitations:
- CoW DELETE currently only supports position deletes
- Equality deletes documented as future enhancement
- Requires Iceberg format version 2 or higher
Fixes: trinodb#26161
Addresses design considerations from: trinodb#17272
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Dec 30, 2025
…tions
This change implements configurable copy-on-write (CoW) mode for row-level
operations in Iceberg tables, giving users control over the read vs write
performance trade-off.
Motivation:
Iceberg supports two approaches for row-level modifications:
- Merge-on-Read (MoR): Fast writes, slower reads (current Trino behavior)
- Copy-on-Write (CoW): Slower writes, fast reads (previously unsupported)
Users with read-heavy workloads on frequently updated tables have been
requesting CoW support to eliminate read-time overhead of merging delete
files with data files. This is particularly valuable for:
- Analytics workloads with frequent small updates
- Tables with high read/write ratios
- Use cases requiring predictable read performance
Implementation:
This change adds three new table properties to control write mode per
operation type:
- `write_delete_mode`: 'merge-on-read' (default) or 'copy-on-write'
- `write_update_mode`: 'merge-on-read' (default) or 'copy-on-write'
- `write_merge_mode`: 'merge-on-read' (default) or 'copy-on-write'
Key changes:
1. **UpdateKind tracking**: Added UpdateKind enum (DELETE/UPDATE/MERGE) to
IcebergTableHandle to track operation type throughout query lifecycle
- `applyDelete()` sets UpdateKind.DELETE
- `getUpdateLayout()` sets UpdateKind.UPDATE
- `beginMerge()` sets UpdateKind.MERGE
2. **UpdateMode resolution**: Added UpdateMode enum (MERGE_ON_READ/COPY_ON_WRITE)
to resolve write mode from table properties based on UpdateKind
3. **CoW DELETE implementation**: Modified `finishWrite()` to detect CoW DELETE
operations and delegate to new `rewriteDataFilesForCowDelete()` method
- Reads manifests to locate DataFile objects for files to delete
- For each affected data file:
* Reads position deletes from delete files into Roaring64Bitmap
* Opens original data file via IcebergPageSourceProvider
* Filters deleted positions page-by-page using Block.copyPositions()
* Writes filtered data to new file with proper metrics
* Manages rollback lifecycle for cleanup on errors
- Uses Iceberg's RewriteFiles API for atomic commit
4. **Resource management**: Proper handling of rollback lifecycle
- Rollback handle kept until file successfully added to transaction
- Cleanup guaranteed via try-catch-finally blocks
- Suppressed exceptions preserve original error context
5. **Optimizations**:
- Direct manifest reading with early termination when all files found
- Efficient position delete filtering using Roaring64Bitmap
- Reuses existing PositionDeleteFilter implementation for consistency
6. **Safety features**:
- Snapshot isolation via Iceberg's validation mechanisms
- Null checks for empty tables
- Missing file validation before commit
- Format version check (requires v2 for row-level operations)
Testing:
- Added TestIcebergCopyOnWriteOperations with 17 comprehensive tests
* Basic operations (DELETE, UPDATE, MERGE)
* Partitioned tables
* Large batch operations (1000-5000 rows)
* Performance benchmarks with metrics collection
* Error cases (empty tables, format v1, etc.)
- Added TestIcebergCopyOnWriteDeleteOperations with 15 unit tests
* File rewriting with position deletes
* Error handling (IO exceptions, missing files, etc.)
* Resource cleanup verification
* Edge cases (empty deletes, equality deletes, etc.)
- Added TestIcebergCopyOnWriteIntegration with 5 integration tests
* Resource cleanup on failure
* Concurrent operations
* Snapshot isolation
* Conflict resolution
- Updated 8 test files to reflect new IcebergMetadata constructor signature
Documentation:
- Added comprehensive COPY_ON_WRITE_README.md with:
* Feature overview and motivation
* Configuration examples
* Usage patterns and recommendations
* Performance characteristics (CoW vs MoR comparison)
* Implementation deep dive with code examples
* Troubleshooting guide
* Known limitations and future enhancements
Limitations:
- CoW DELETE currently only supports position deletes
- Equality deletes documented as future enhancement
- Requires Iceberg format version 2 or higher
Fixes: trinodb#26161
Addresses design considerations from: trinodb#17272
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 4, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. ## Overview Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. ## New Features ### Table Properties - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' ### New Enums - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) ### Core Implementation **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations ## Position Delete Handling Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification ### Key Implementation Details - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages ## Safety and Reliability **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues ## Testing Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks ## Documentation **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection ## Backward Compatibility All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables ## Performance Characteristics **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ## Usage Example ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` ## Files Changed 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 4, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 5, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 5, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 5, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 5, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 16, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 22, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Jan 27, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Mar 12, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
nookcreed
pushed a commit
to nookcreed/trino
that referenced
this pull request
Mar 14, 2026
…tions This commit implements Copy-on-Write (CoW) mode for Iceberg tables, providing users with explicit control over the read vs. write performance trade-off. Previously, only Merge-on-Read (MoR) was supported. Adds support for both Merge-on-Read and Copy-on-Write strategies: - **Merge-on-Read**: Fast writes, creates delete files (existing behavior) - **Copy-on-Write**: Fast reads, rewrites data files (new feature) Users can configure the write mode independently for DELETE, UPDATE, and MERGE operations using new table properties. - `write_delete_mode`: Controls DELETE behavior (default: 'merge-on-read') - `write_update_mode`: Controls UPDATE behavior (default: 'merge-on-read') - `write_merge_mode`: Controls MERGE behavior (default: 'merge-on-read') Each accepts: 'merge-on-read' or 'copy-on-write' - `UpdateMode`: Defines MERGE_ON_READ and COPY_ON_WRITE strategies - `UpdateKind`: Tracks operation type (DELETE, UPDATE, MERGE) **IcebergMetadata.java** (+827 lines): - `rewriteDataFilesForCowDelete()`: Main CoW implementation - Loads manifests to locate DataFile objects - Reads position deletes into Roaring64Bitmap - Filters data page-by-page, skipping deleted rows - Writes new data files with proper metrics - Uses Iceberg's RewriteFiles API for atomic commit - `readPositionDeletesFromDeleteFile()`: Efficient delete file reading - Enhanced `finishWrite()` to route CoW operations correctly - Schema evolution support with null checks **IcebergTableHandle.java**: - Added `UpdateKind` field to track operation type - Preserves operation context throughout query lifecycle **IcebergTableProperties.java**: - Registered new table properties - Default values ensure backward compatibility **Query Planning**: - `applyDelete()`: Routes DELETE to appropriate path (MoR vs CoW) - `beginMerge()`: Sets UpdateKind for MERGE operations Correct implementation of Iceberg's positional delete specification: 1. **DeleteManager.java**: Fixed type interface - Changed to `ImmutableLongBitmapDataProvider` for immutability - Prevents accidental bitmap mutation during processing 2. **IcebergMergeSink.java**: Fixed position delete writer interface - Added explicit cast to `ImmutableLongBitmapDataProvider` - Documented internal commit behavior to prevent double-commits 3. **IcebergPageSourceProvider.java**: Schema evolution support - Graceful handling of missing partition fields - Filters null column handles from evolved schemas 4. **PositionDeleteWriter.java**: Added getter for testing - Exposed `getWriter()` for test verification - Row positions are 0-based within each data file (Iceberg standard) - Uses `Roaring64Bitmap` for memory-efficient position tracking - Page-level filtering processes data incrementally - Correctly maintains current row position across all pages **Resource Management**: - Explicit rollback lifecycle for file writers - Try-catch-finally blocks ensure cleanup - Suppressed exceptions preserve original error context **Validation**: - Snapshot isolation using Iceberg's optimistic concurrency - Null checks for empty tables - Missing file validation before commit - Format version checks (requires v2 for row-level operations) **Error Handling**: - All IOExceptions wrapped in TrinoException - File paths included in error messages for debuggability - Clear error messages for common issues Comprehensive test suite with **25 test methods** across 3 classes (1,858 lines): **TestIcebergCopyOnWriteOperations** (19 tests, 952 lines): - Basic operations: DELETE, UPDATE, MERGE with CoW - Edge cases: empty tables, format v1 compatibility - Partitioning: single and cross-partition operations - Large batches: 1,000-5,000 row operations - Performance benchmarks and CoW vs MoR comparisons **TestIcebergCopyOnWriteIntegration** (5 tests, 642 lines): - Resource cleanup on failure - Concurrent operations - Snapshot isolation - Conflict detection and retry - Logging verification **TestIcebergCopyOnWriteDeleteOperations** (1 test, 264 lines): - Low-level delete operation testing Test coverage includes: - Basic CRUD operations with CoW mode - Partitioning and partition pruning - Format version compatibility - Large-scale operations (batch testing) - Concurrency, isolation, and conflict handling - Resource cleanup and error handling - Performance benchmarks **COPY_ON_WRITE.md** (+257 lines): - Quick start guide with SQL examples - Architecture overview and component descriptions - Position delete handling implementation details - Performance characteristics and trade-offs - When to use CoW vs MoR (decision guide) - Troubleshooting guide and best practices - Design rationale for user-specified mode selection All changes are backward compatible: - Default behavior unchanged (MERGE_ON_READ) - New table properties optional - Schema evolution handled gracefully - `UpdateKind` field in IcebergTableHandle is Optional - Format version validation protects v1 tables **Copy-on-Write**: - Fast reads (no merge overhead) - Slower writes (file rewriting) - High write amplification - Cleaner file structure **Merge-on-Read** (existing): - Fast writes (small delete files) - Slower reads (merge at query time) - Low write amplification - Small file accumulation ```sql -- Enable Copy-on-Write for DELETE operations ALTER TABLE my_table SET PROPERTIES write_delete_mode = 'copy-on-write'; -- Create table with CoW enabled CREATE TABLE events ( id BIGINT, event_time TIMESTAMP, data VARCHAR ) WITH ( format_version = 2, write_delete_mode = 'copy-on-write', write_update_mode = 'copy-on-write' ); ``` 26 files changed: 3,403 insertions(+), 93 deletions(-) **Core Implementation**: - IcebergMetadata.java (+827 lines) - IcebergTableHandle.java (+72 lines) - IcebergTableProperties.java (+39 lines) - IcebergUtil.java (+18 lines) - UpdateMode.java (new, 68 lines) - UpdateKind.java (new, 34 lines) **Position Delete Fixes**: - DeleteManager.java - IcebergMergeSink.java - IcebergPageSourceProvider.java - PositionDeleteWriter.java **Testing**: - TestIcebergCopyOnWriteOperations.java (new, 952 lines) - TestIcebergCopyOnWriteIntegration.java (new, 642 lines) - TestIcebergCopyOnWriteDeleteOperations.java (new, 264 lines) - 7 test files updated for compatibility **Documentation**: - COPY_ON_WRITE.md (new, 257 lines) Fixes: trinodb#26161 Fix tests Fix tests Add missing imports and use OrcReaderOptions/ParquetReaderOptions instead of null Remove unused schema parameter from rewriteDataFilesForCowDelete method Fix unused parameter warning in IcebergMetadata.java that was causing a compilation error with error-prone. The schema parameter was declared but never used in the method body. Remove unused queryRunner field from TestIcebergCopyOnWriteIntegration Fixes compiler warning for unused variable by removing: - Unused queryRunner field - Unnecessary setup() and tearDown() methods - Unused imports for BeforeAll and AfterAll annotations Fix orcReaderOptions null in BaseTrinoCatalogTest Add missing imports for OrcReaderOptions and ParquetReaderOptions and replace null values with proper instances in IcebergPageSourceProvider constructor calls. This fixes the test failures in all catalog test classes that extend BaseTrinoCatalogTest. Fix documentation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for COW support for Iceberg Connector.
For the user interface part, we respect iceberg specification https://iceberg.apache.org/docs/latest/configuration/#write-properties add add three table properties
write_delete_mode,write_update_mode,write_merge_mode, to control the write behavior of each specific table change operation, node that this is consistent with other popular engines like Spark and Dremio: https://docs.dremio.com/25.x/sonar/query-manage/data-formats/apache-iceberg/table-properties/.Since all three operations(delete, write, merge) would use the same
MergeWriterOperatorduring table write, the engine is not able to tell which operation is in changing the table from insideMergeWriterOperatorandIcebergMeatdata.getMergeRowIdColumnHandleto decide either to use COW or MOR to perform the table change. This PR adds an optional field calledUpdateKindinIcebertTableHandlein order to the record operation being performed on the table. This field is populated bymetata.getTableHandleby adding a new parameterUpdateKind, which is retrieved atStatementAnalyzer.As for the COW implementation part, we choose to implement it based on the current MOR implementation framework(MergeProcessOperator+MergeWriterOperator) similar to the COW implementation in Deltalake Connector. The general idea is to rewrite the changed data files based on current data file contents and the delta information from the current change operation. The current data file path is already embedded in the pages for the MOR implementation, and we can read the current data file contents from this path, the delta calculating is also already available for the MOR implementation, however, one caveat is that the data file may also contain several delete files, which should be used to calculate the new version of the data file but is not being embeded in the pages in the current MOR implementation, so we also embed delete files information for the COW implementation.
User facing docs will be updated as when the reviewers are satisfied with the implementation.
Description
Fix #17272
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: