[presto][iceberg] Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3#27396
Closed
apurva-meta wants to merge 6 commits intoprestodb:masterfrom
Closed
[presto][iceberg] Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3#27396apurva-meta wants to merge 6 commits intoprestodb:masterfrom
apurva-meta wants to merge 6 commits intoprestodb:masterfrom
Conversation
Contributor
Reviewer's GuideImplements Iceberg V3 nanosecond timestamp (TIMESTAMP_NANO) support and fully wires Iceberg V3 deletion vectors (PUFFIN format) and row-level operations (DELETE/UPDATE/MERGE) through Presto’s Java and C++/Velox stacks, including metadata/protocol changes, routing, and comprehensive integration tests plus a DV compaction procedure. Sequence diagram for Iceberg V3 deletion vector write pathsequenceDiagram
actor User
participant Coordinator
participant Worker
participant IcebergPageSourceProvider
participant IcebergUpdateablePageSource
participant IcebergDeletionVectorPageSink
participant HdfsEnvironment
participant IcebergAbstractMetadata
participant IcebergTable
User->>Coordinator: submit DELETE statement
Coordinator->>Worker: schedule task with IcebergSplit
Worker->>IcebergPageSourceProvider: createPageSource(split)
IcebergPageSourceProvider->>IcebergPageSourceProvider: read storageProperties
IcebergPageSourceProvider-->>IcebergPageSourceProvider: tableFormatVersion >= 3
IcebergPageSourceProvider->>IcebergUpdateablePageSource: construct with deleteSinkSupplier returning IcebergDeletionVectorPageSink
loop for each deleted row
IcebergUpdateablePageSource->>IcebergDeletionVectorPageSink: appendPage(page_of_row_positions)
IcebergDeletionVectorPageSink->>IcebergDeletionVectorPageSink: collect row_positions (int)
end
Worker->>IcebergUpdateablePageSource: finish()
IcebergUpdateablePageSource->>IcebergDeletionVectorPageSink: finish()
IcebergDeletionVectorPageSink->>IcebergDeletionVectorPageSink: sort positions and serialize roaring bitmap
IcebergDeletionVectorPageSink->>HdfsEnvironment: write Puffin file with Blob deletion_vector_v2
IcebergDeletionVectorPageSink-->>IcebergUpdateablePageSource: CommitTaskData (FileFormat.PUFFIN, POSITION_DELETES, contentOffset, contentSizeInBytes, recordCount)
IcebergUpdateablePageSource-->>Coordinator: commit tasks (including DV CommitTaskData)
Coordinator->>IcebergAbstractMetadata: finishDeleteWithOutput(tasks)
IcebergAbstractMetadata->>IcebergTable: newRewrite().rewriteFiles(remove_old_DVs, add_PUFFIN_DVs)
IcebergTable-->>User: DELETE committed with V3 deletion vectors
Class diagram for updated Iceberg delete/DV metadata typesclassDiagram
class CommitTaskData {
-String path
-long fileSizeInBytes
-MetricsWrapper metrics
-int partitionSpecId
-Optional~String~ partitionDataJson
-FileFormat fileFormat
-Optional~String~ referencedDataFile
-FileContent content
-OptionalLong contentOffset
-OptionalLong contentSizeInBytes
-OptionalLong recordCount
+CommitTaskData(path, fileSizeInBytes, metrics, partitionSpecId, partitionDataJson, fileFormat, referencedDataFile, content, contentOffset, contentSizeInBytes, recordCount)
+CommitTaskData(path, fileSizeInBytes, metrics, partitionSpecId, partitionDataJson, fileFormat, referencedDataFile, content)
+getPath() String
+getFileSizeInBytes() long
+getMetrics() MetricsWrapper
+getPartitionSpecId() int
+getPartitionDataJson() Optional~String~
+getFileFormat() FileFormat
+getReferencedDataFile() Optional~String~
+getContent() FileContent
+getContentOffset() OptionalLong
+getContentSizeInBytes() OptionalLong
+getRecordCount() OptionalLong
}
class DeleteFileJava {
<<final>>
-FileContent content
-String path
-FileFormat fileFormat
-long recordCount
-long fileSizeInBytes
-List~Integer~ equalityFieldIds
-Map~Integer,byte[]~ lowerBounds
-Map~Integer,byte[]~ upperBounds
-long dataSequenceNumber
+fromIceberg(deleteFile) DeleteFileJava
+DeleteFileJava(content, path, fileFormat, recordCount, fileSizeInBytes, equalityFieldIds, lowerBounds, upperBounds, dataSequenceNumber)
+getContent() FileContent
+getPath() String
+getFileFormat() FileFormat
+getRecordCount() long
+getFileSizeInBytes() long
+getEqualityFieldIds() List~Integer~
+getLowerBounds() Map~Integer,byte[]~
+getUpperBounds() Map~Integer,byte[]~
+getDataSequenceNumber() long
+toString() String
}
class DeleteFileCpp {
<<struct>>
+FileContent content
+String path
+FileFormat format
+int64_t recordCount
+int64_t fileSizeInBytes
+List~Integer~ equalityFieldIds
+Map~Integer,String~ lowerBounds
+Map~Integer,String~ upperBounds
+int64_t dataSequenceNumber
+to_json(j, DeleteFileCpp)
+from_json(j, DeleteFileCpp)
}
class FileFormatJava {
<<enum>>
ORC
PARQUET
AVRO
METADATA
PUFFIN
-String ext
-boolean splittable
+fromIcebergFileFormat(format) FileFormatJava
}
class FileFormatCpp {
<<enum>>
ORC
PARQUET
AVRO
METADATA
PUFFIN
}
class FileContentJava {
<<enum>>
DATA
POSITION_DELETES
EQUALITY_DELETES
}
class FileContentCpp {
<<enum>>
DATA
POSITION_DELETES
EQUALITY_DELETES
}
CommitTaskData --> FileFormatJava : uses
CommitTaskData --> FileContentJava : uses
CommitTaskData --> DeleteFileJava : serialized as
DeleteFileJava --> FileFormatJava : uses
DeleteFileJava --> FileContentJava : uses
DeleteFileCpp --> FileFormatCpp : uses
DeleteFileCpp --> FileContentCpp : uses
FileFormatJava <--> FileFormatCpp : protocol_mapping
FileContentJava <--> FileContentCpp : protocol_mapping
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- There are multiple independent implementations of Roaring bitmap (de)serialization for deletion vectors (e.g., in tests,
IcebergDeletionVectorPageSink, andRewriteDeleteFilesProcedure), which increases the risk of subtle bugs and divergence; consider centralizing this logic in a single utility and/or delegating to a well-tested Roaring library wherever possible. - In
RewriteDeleteFilesProcedure.rewriteDeleteFiles, the commit flow is a bit obscure (e.g., calling bothrewriteFiles.commit()andmetadata.commit()whilemetadatais created ad hoc via the factory); it would help maintainability to clarify or refactor ownership of commits so it’s clear which component is responsible for finalizing Iceberg changes. - The Roaring bitmap
deserializeRoaringBitmapinRewriteDeleteFilesProcedureimplements support for both array and bitmap containers and partially handles run containers via cookie bits; it might be safer to explicitly validate the cookie and container types you expect (and fail fast otherwise) to avoid silently ignoring or misinterpreting newer/unsupported encodings.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There are multiple independent implementations of Roaring bitmap (de)serialization for deletion vectors (e.g., in tests, `IcebergDeletionVectorPageSink`, and `RewriteDeleteFilesProcedure`), which increases the risk of subtle bugs and divergence; consider centralizing this logic in a single utility and/or delegating to a well-tested Roaring library wherever possible.
- In `RewriteDeleteFilesProcedure.rewriteDeleteFiles`, the commit flow is a bit obscure (e.g., calling both `rewriteFiles.commit()` and `metadata.commit()` while `metadata` is created ad hoc via the factory); it would help maintainability to clarify or refactor ownership of commits so it’s clear which component is responsible for finalizing Iceberg changes.
- The Roaring bitmap `deserializeRoaringBitmap` in `RewriteDeleteFilesProcedure` implements support for both array and bitmap containers and partially handles run containers via cookie bits; it might be safer to explicitly validate the cookie and container types you expect (and fail fast otherwise) to avoid silently ignoring or misinterpreting newer/unsupported encodings.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 21, 2026
…e support for Iceberg V3 (prestodb#27396) Summary: Iceberg V3 introduces nanosecond-precision timestamp types (timestamp_ns and timestamptz_ns). This diff adds support for reading tables with these column types by mapping them to Presto's best available precision (TIMESTAMP_MICROSECONDS for timestamp_ns, TIMESTAMP_WITH_TIME_ZONE for timestamptz_ns). Changes: - TypeConverter: Map TIMESTAMP_NANO to Presto types and ORC types - ExpressionConverter: Fix predicate pushdown for TIMESTAMP_MICROSECONDS precision (was incorrectly converting microseconds as milliseconds) - IcebergUtil: Handle TIMESTAMP_NANO partition values (nanos → micros) - PartitionData: Handle TIMESTAMP_NANO in JSON partition deserialization - PartitionTable: Convert nanosecond partition values to microseconds - TestIcebergV3: Add testNanosecondTimestampSchema integration test == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531552
f280a08 to
b7ba367
Compare
This was referenced Mar 23, 2026
added 5 commits
March 24, 2026 16:35
…tensibility Summary: - Reformat FileContent enum in presto_protocol_iceberg.h from single-line to multi-line for better readability and future extension. - Add blank line for visual separation before infoColumns initialization. Protocol files are auto-generated from Java sources via chevron. The manual edits here mirror what the generator would produce once the Java changes are landed and the protocol is regenerated. Differential Revision: D97531548
…equality delete conflict resolution Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity Differential Revision: D97531547
…discovery Summary: Iceberg V3 introduces deletion vectors stored as blobs inside Puffin files. Previously, the coordinator's IcebergSplitSource rejected PUFFIN-format delete files with a NOT_SUPPORTED error, preventing V3 deletion vectors from being discovered and sent to workers. This diff: 1. Adds PUFFIN to the FileFormat enum (both presto-trunk and presto-facebook-trunk) so fromIcebergFileFormat() can convert Iceberg's PUFFIN format to Presto's FileFormat.PUFFIN. 2. Removes the PUFFIN rejection check in presto-trunk's IcebergSplitSource.toIcebergSplit(), allowing deletion vector files to flow through to workers. 3. Updates TestIcebergV3 to verify PUFFIN files are accepted rather than rejected at split enumeration time. The C++ worker-side changes (protocol enum + connector conversion) will follow in a separate diff. Differential Revision: D97531557
…nd connector layer
Summary:
This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.
Changes:
1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
serialization table in presto_protocol_iceberg.{h,cpp}.
2. Handles PUFFIN in toVeloxFileFormat() in
IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
placeholder since DeletionVectorReader reads raw binary and
does not use the DWRF/Parquet reader infrastructure.
Differential Revision: D97531555
…age sink and compaction procedure Summary: - Add IcebergDeletionVectorPageSink for writing DV files during table maintenance - Add RewriteDeleteFilesProcedure for DV compaction - Wire DV page sink through IcebergCommonModule, IcebergAbstractMetadata, IcebergPageSourceProvider - Add IcebergUpdateablePageSource for DV-aware page source - Update CommitTaskData, IcebergUtil for DV support - Add test coverage in TestIcebergV3 Differential Revision: D97531549
b7ba367 to
e66368e
Compare
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 27, 2026
…e support for Iceberg V3 (prestodb#27396) Summary: Iceberg V3 introduces nanosecond-precision timestamp types (timestamp_ns and timestamptz_ns). This diff adds support for reading tables with these column types by mapping them to Presto's best available precision (TIMESTAMP_MICROSECONDS for timestamp_ns, TIMESTAMP_WITH_TIME_ZONE for timestamptz_ns). Changes: - TypeConverter: Map TIMESTAMP_NANO to Presto types and ORC types - ExpressionConverter: Fix predicate pushdown for TIMESTAMP_MICROSECONDS precision (was incorrectly converting microseconds as milliseconds) - IcebergUtil: Handle TIMESTAMP_NANO partition values (nanos → micros) - PartitionData: Handle TIMESTAMP_NANO in JSON partition deserialization - PartitionTable: Convert nanosecond partition values to microseconds - TestIcebergV3: Add testNanosecondTimestampSchema integration test == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531552
…ort for Iceberg V3 Summary: Iceberg V3 introduces nanosecond-precision timestamp types (timestamp_ns and timestamptz_ns). This diff adds support for reading tables with these column types by mapping them to Presto's best available precision (TIMESTAMP_MICROSECONDS for timestamp_ns, TIMESTAMP_WITH_TIME_ZONE for timestamptz_ns). Changes: - TypeConverter: Map TIMESTAMP_NANO to Presto types and ORC types - ExpressionConverter: Fix predicate pushdown for TIMESTAMP_MICROSECONDS precision (was incorrectly converting microseconds as milliseconds) - IcebergUtil: Handle TIMESTAMP_NANO partition values (nanos → micros) - PartitionData: Handle TIMESTAMP_NANO in JSON partition deserialization - PartitionTable: Convert nanosecond partition values to microseconds - TestIcebergV3: Add testNanosecondTimestampSchema integration test Differential Revision: D97531552
e66368e to
6eec460
Compare
|
Contributor
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Iceberg V3 introduces nanosecond-precision timestamp types (timestamp_ns
and timestamptz_ns). This diff adds support for reading tables with these
column types by mapping them to Presto's best available precision
(TIMESTAMP_MICROSECONDS for timestamp_ns, TIMESTAMP_WITH_TIME_ZONE for
timestamptz_ns).
Changes:
precision (was incorrectly converting microseconds as milliseconds)
Differential Revision: D97531552