Skip to content

[prestissimo][iceberg] Wire PUFFIN file format through C++ protocol and connector layer#27394

Closed
apurva-meta wants to merge 4 commits intoprestodb:masterfrom
apurva-meta:export-D97531555
Closed

[prestissimo][iceberg] Wire PUFFIN file format through C++ protocol and connector layer#27394
apurva-meta wants to merge 4 commits intoprestodb:masterfrom
apurva-meta:export-D97531555

Conversation

@apurva-meta
Copy link
Copy Markdown
Contributor

@apurva-meta apurva-meta commented Mar 21, 2026

Summary:
This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.

Changes:

  1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
    serialization table in presto_protocol_iceberg.{h,cpp}.
  2. Handles PUFFIN in toVeloxFileFormat() in
    IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
    placeholder since DeletionVectorReader reads raw binary and
    does not use the DWRF/Parquet reader infrastructure.

Differential Revision: D97531555

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 21, 2026

Reviewer's Guide

Wires the Iceberg PUFFIN file format and delete-file data sequence numbers through the Java and C++ Presto/Iceberg protocol and connector layers so native workers can correctly route V3 deletion vector files to the DeletionVectorReader while accepting PUFFIN deletes during split enumeration.

Class diagram for updated Iceberg delete file and file format types

classDiagram
    direction LR

    class JavaFileFormat {
      <<enum>>
      ORC
      PARQUET
      AVRO
      METADATA
      PUFFIN
      +fromIcebergFileFormat(format)
    }

    class JavaDeleteFile {
      <<final>>
      -FileContent content
      -String path
      -FileFormat format
      -long recordCount
      -long fileSizeInBytes
      -List~Integer~ equalityFieldIds
      -Map~Integer, byte[]~ lowerBounds
      -Map~Integer, byte[]~ upperBounds
      -long dataSequenceNumber
      +fromIceberg(deleteFile)
      +getContent() FileContent
      +getPath() String
      +getFormat() FileFormat
      +getRecordCount() long
      +getFileSizeInBytes() long
      +getEqualityFieldIds() List~Integer~
      +getLowerBounds() Map~Integer, byte[]~
      +getUpperBounds() Map~Integer, byte[]~
      +getDataSequenceNumber() long
    }

    class CppFileFormat {
      <<enum class>>
      ORC
      PARQUET
      AVRO
      METADATA
      PUFFIN
    }

    class CppFileContent {
      <<enum class>>
      DATA
      POSITION_DELETES
      EQUALITY_DELETES
    }

    class CppDeleteFile {
      <<struct>>
      +FileContent content
      +String path
      +FileFormat format
      +int64_t recordCount
      +int64_t fileSizeInBytes
      +List~Integer~ equalityFieldIds
      +Map~Integer, String~ lowerBounds
      +Map~Integer, String~ upperBounds
      +int64_t dataSequenceNumber
      +to_json(j, p)
      +from_json(j, p)
    }

    class VeloxFileFormat {
      <<enum>>
      ORC
      PARQUET
      DWRF
    }

    class VeloxFileContent {
      <<enum>>
      kData
      kPositionalDeletes
      kEqualityDeletes
      kDeletionVector
    }

    class IcebergDeleteFileVelox {
      <<class>>
      +FileContent content
      +string path
      +FileFormat format
      +int64_t recordCount
      +int64_t fileSizeInBytes
      +vector~int32_t~ equalityFieldIds
      +unordered_map~int32_t, string~ lowerBounds
      +unordered_map~int32_t, string~ upperBounds
      +int64_t dataSequenceNumber
    }

    JavaDeleteFile --> JavaFileFormat : uses
    CppDeleteFile --> CppFileFormat : uses
    CppDeleteFile --> CppFileContent : uses
    IcebergDeleteFileVelox --> VeloxFileFormat : uses
    IcebergDeleteFileVelox --> VeloxFileContent : uses

    CppFileFormat <--> JavaFileFormat : protocol_mapping
    CppDeleteFile <--> JavaDeleteFile : JSON_protocol

    class IcebergPrestoToVeloxConnector {
      <<class>>
      +toVeloxFileContent(content) VeloxFileContent
      +toVeloxFileFormat(format) VeloxFileFormat
      +toVeloxSplit(catalogId, connectorSplit, splitContext) unique_ptr~HiveIcebergSplit~
    }

    IcebergPrestoToVeloxConnector --> CppDeleteFile : reads
    IcebergPrestoToVeloxConnector --> IcebergDeleteFileVelox : constructs
    IcebergPrestoToVeloxConnector ..> VeloxFileFormat : maps_PUFFIN_to_DWRF
    IcebergPrestoToVeloxConnector ..> VeloxFileContent : reclassifies_PUFFIN_DV_to_kDeletionVector
Loading

File-Level Changes

Change Details Files
Propagate PUFFIN file format and delete-file dataSequenceNumber through the Presto Iceberg protocol (Java and C++).
  • Extend Java Iceberg FileFormat enum to support PUFFIN and map from Iceberg’s FileFormat.PUFFIN
  • Add dataSequenceNumber field to the Java DeleteFile model, populate it from Iceberg DeleteFile, expose via JSON, and include it in toString()
  • Extend C++ Iceberg FileFormat enum with PUFFIN and update its JSON serialization/deserialization table
  • Extend C++ DeleteFile struct and its JSON (de)serialization to include dataSequenceNumber
presto-iceberg/src/main/java/com/facebook/presto/iceberg/FileFormat.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/delete/DeleteFile.java
presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.h
presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp
Adjust native Iceberg connector split conversion to correctly classify PUFFIN deletion vectors and pass sequence numbers into Velox.
  • Update toVeloxFileContent() to handle EQUALITY_DELETES and map them to Velox kEqualityDeletes
  • Update toVeloxFileFormat() to accept PUFFIN, mapping it to DWRF as an unused placeholder format for DeletionVectorReader
  • Reclassify delete files that are POSITION_DELETES+PUFFIN into kDeletionVector before constructing Velox IcebergDeleteFile objects
  • Pass deleteFile.dataSequenceNumber into Velox IcebergDeleteFile and propagate icebergSplit->dataSequenceNumber into HiveIcebergSplit plus infoColumns
  • Refactor local variables in toVeloxSplit() to use const where appropriate and avoid repeated unchecked pointer access
presto-native-execution/presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp
Allow PUFFIN deletion vectors in the Java Iceberg split source and update tests to assert acceptance instead of rejection.
  • Remove the NOT_SUPPORTED guard that rejected PUFFIN deletion vector delete files during split enumeration
  • Rename and rewrite the Iceberg V3 test to verify that PUFFIN deletion vectors are accepted by the split source and that failures no longer reference the old unsupported error message
  • Add an assertion helper import used by the updated test
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSplitSource.java
presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergV3.java

Possibly linked issues

  • #native(Iceberg): The PR implements Iceberg V3 Puffin-based deletion vectors and related delete semantics, directly addressing the requested DV support.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In toVeloxFileFormat(), consider explicitly guarding the PUFFIN→DWRF mapping with a check on file content (e.g., only for deletion-vector content) or an assertion, so that PUFFIN cannot silently be misrouted if introduced for non-DV files in the future.
  • In IcebergPrestoToVeloxConnector::toVeloxSplit, you can avoid the NOLINT(facebook-bugprone-unchecked-pointer-access) by binding *icebergSplit to a reference after VELOX_CHECK_NOT_NULL and then using that reference for dataSequenceNumber and other fields instead of the raw pointer.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `toVeloxFileFormat()`, consider explicitly guarding the PUFFIN→DWRF mapping with a check on file content (e.g., only for deletion-vector content) or an assertion, so that PUFFIN cannot silently be misrouted if introduced for non-DV files in the future.
- In `IcebergPrestoToVeloxConnector::toVeloxSplit`, you can avoid the `NOLINT(facebook-bugprone-unchecked-pointer-access)` by binding `*icebergSplit` to a reference after `VELOX_CHECK_NOT_NULL` and then using that reference for `dataSequenceNumber` and other fields instead of the raw pointer.

## Individual Comments

### Comment 1
<location path="presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergV3.java" line_range="317-326" />
<code_context>
+            try {
+                computeActual("SELECT * FROM " + tableName);
+            }
+            catch (RuntimeException e) {
+                // Verify the error is NOT the old "PUFFIN not supported" rejection.
+                // Other failures (e.g., fake .puffin file not on disk) are acceptable.
+                assertFalse(
+                        e.getMessage().contains("Iceberg deletion vectors") && e.getMessage().contains("not supported"),
+                        "PUFFIN deletion vectors should be accepted, not rejected: " + e.getMessage());
</code_context>
<issue_to_address>
**suggestion:** Defensive handling of a potential null exception message to avoid NPEs in the test itself.

This assertion calls `e.getMessage().contains(...)` twice; if the exception message is null, the test will throw `NullPointerException` instead of cleanly asserting on PUFFIN support. You can defensively handle this by normalizing the message first, e.g.

```java
tString message = String.valueOf(e.getMessage());
assertFalse(
        message.contains("Iceberg deletion vectors") && message.contains("not supported"),
        "PUFFIN deletion vectors should be accepted, not rejected: " + message);
```

so the test remains stable even when the exception message is null.

```suggestion
            try {
                computeActual("SELECT * FROM " + tableName);
            }
            catch (RuntimeException e) {
                // Verify the error is NOT the old "PUFFIN not supported" rejection.
                // Other failures (e.g., fake .puffin file not on disk) are acceptable.
                String message = String.valueOf(e.getMessage());
                assertFalse(
                        message.contains("Iceberg deletion vectors") && message.contains("not supported"),
                        "PUFFIN deletion vectors should be accepted, not rejected: " + message);
            }
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +317 to +326
try {
computeActual("SELECT * FROM " + tableName);
}
catch (RuntimeException e) {
// Verify the error is NOT the old "PUFFIN not supported" rejection.
// Other failures (e.g., fake .puffin file not on disk) are acceptable.
assertFalse(
e.getMessage().contains("Iceberg deletion vectors") && e.getMessage().contains("not supported"),
"PUFFIN deletion vectors should be accepted, not rejected: " + e.getMessage());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Defensive handling of a potential null exception message to avoid NPEs in the test itself.

This assertion calls e.getMessage().contains(...) twice; if the exception message is null, the test will throw NullPointerException instead of cleanly asserting on PUFFIN support. You can defensively handle this by normalizing the message first, e.g.

tString message = String.valueOf(e.getMessage());
assertFalse(
        message.contains("Iceberg deletion vectors") && message.contains("not supported"),
        "PUFFIN deletion vectors should be accepted, not rejected: " + message);

so the test remains stable even when the exception message is null.

Suggested change
try {
computeActual("SELECT * FROM " + tableName);
}
catch (RuntimeException e) {
// Verify the error is NOT the old "PUFFIN not supported" rejection.
// Other failures (e.g., fake .puffin file not on disk) are acceptable.
assertFalse(
e.getMessage().contains("Iceberg deletion vectors") && e.getMessage().contains("not supported"),
"PUFFIN deletion vectors should be accepted, not rejected: " + e.getMessage());
}
try {
computeActual("SELECT * FROM " + tableName);
}
catch (RuntimeException e) {
// Verify the error is NOT the old "PUFFIN not supported" rejection.
// Other failures (e.g., fake .puffin file not on disk) are acceptable.
String message = String.valueOf(e.getMessage());
assertFalse(
message.contains("Iceberg deletion vectors") && message.contains("not supported"),
"PUFFIN deletion vectors should be accepted, not rejected: " + message);
}

apurva-meta added a commit to apurva-meta/presto that referenced this pull request Mar 21, 2026
…ocol and connector layer (prestodb#27394)

Summary:

This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.

Changes:
1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
   serialization table in presto_protocol_iceberg.{h,cpp}.
2. Handles PUFFIN in toVeloxFileFormat() in
   IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
   placeholder since DeletionVectorReader reads raw binary and
   does not use the DWRF/Parquet reader infrastructure.

== RELEASE NOTES ==
General Changes
* Upgrade Apache Iceberg library from 1.10.0 to 1.10.1.
Hive Connector Changes
* Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure.
* Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips.
* Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types.
* Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation.
* Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance.
* Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables.
* Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables.
* Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution.
* Improve IcebergSplitReader error handling and fix test file handle leaks.
* Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries.

Differential Revision: D97531555
@apurva-meta apurva-meta requested review from a team, ZacBlanco and hantangwangd as code owners March 21, 2026 06:49
@meta-codesync meta-codesync bot changed the title feat: [prestissimo][iceberg] Wire PUFFIN file format through C++ protocol and connector layer feat: [prestissimo][iceberg] Wire PUFFIN file format through C++ protocol and connector layer (#27394) Mar 21, 2026
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Mar 21, 2026
…ocol and connector layer (prestodb#27394)

Summary:

This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.

Changes:
1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
   serialization table in presto_protocol_iceberg.{h,cpp}.
2. Handles PUFFIN in toVeloxFileFormat() in
   IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
   placeholder since DeletionVectorReader reads raw binary and
   does not use the DWRF/Parquet reader infrastructure.

== RELEASE NOTES ==
General Changes
* Upgrade Apache Iceberg library from 1.10.0 to 1.10.1.
Hive Connector Changes
* Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure.
* Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips.
* Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types.
* Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation.
* Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance.
* Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables.
* Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables.
* Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution.
* Improve IcebergSplitReader error handling and fix test file handle leaks.
* Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries.

Differential Revision: D97531555
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Mar 21, 2026
…ocol and connector layer (prestodb#27394)

Summary:

This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.

Changes:
1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
   serialization table in presto_protocol_iceberg.{h,cpp}.
2. Handles PUFFIN in toVeloxFileFormat() in
   IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
   placeholder since DeletionVectorReader reads raw binary and
   does not use the DWRF/Parquet reader infrastructure.

== RELEASE NOTES ==
General Changes
* Upgrade Apache Iceberg library from 1.10.0 to 1.10.1.
Hive Connector Changes
* Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure.
* Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips.
* Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types.
* Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation.
* Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance.
* Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables.
* Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables.
* Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution.
* Improve IcebergSplitReader error handling and fix test file handle leaks.
* Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries.

Differential Revision: D97531555
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Mar 21, 2026
…ocol and connector layer (prestodb#27394)

Summary:

This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.

Changes:
1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
   serialization table in presto_protocol_iceberg.{h,cpp}.
2. Handles PUFFIN in toVeloxFileFormat() in
   IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
   placeholder since DeletionVectorReader reads raw binary and
   does not use the DWRF/Parquet reader infrastructure.

== RELEASE NOTES ==
General Changes
* Upgrade Apache Iceberg library from 1.10.0 to 1.10.1.
Hive Connector Changes
* Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure.
* Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips.
* Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types.
* Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation.
* Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance.
* Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables.
* Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables.
* Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution.
* Improve IcebergSplitReader error handling and fix test file handle leaks.
* Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries.

Differential Revision: D97531555
Apurva Kumar added 3 commits March 24, 2026 16:35
…tensibility

Summary:
- Reformat FileContent enum in presto_protocol_iceberg.h from single-line
  to multi-line for better readability and future extension.
- Add blank line for visual separation before infoColumns initialization.

Protocol files are auto-generated from Java sources via chevron. The manual
edits here mirror what the generator would produce once the Java changes
are landed and the protocol is regenerated.

Differential Revision: D97531548
…equality delete conflict resolution

Summary:
Wire the dataSequenceNumber field from the Java Presto protocol to the
C++ Velox connector layer, enabling server-side sequence number conflict
resolution for equality delete files.

Changes:
- Add dataSequenceNumber field to IcebergSplit protocol (Java + C++)
- Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it
  through HiveIcebergSplit to IcebergSplitReader
- Add const qualifiers to local variables for code clarity

Differential Revision: D97531547
…discovery

Summary:
Iceberg V3 introduces deletion vectors stored as blobs inside Puffin files.
Previously, the coordinator's IcebergSplitSource rejected PUFFIN-format delete
files with a NOT_SUPPORTED error, preventing V3 deletion vectors from being
discovered and sent to workers.

This diff:
1. Adds PUFFIN to the FileFormat enum (both presto-trunk and
   presto-facebook-trunk) so fromIcebergFileFormat() can convert
   Iceberg's PUFFIN format to Presto's FileFormat.PUFFIN.
2. Removes the PUFFIN rejection check in presto-trunk's
   IcebergSplitSource.toIcebergSplit(), allowing deletion vector
   files to flow through to workers.
3. Updates TestIcebergV3 to verify PUFFIN files are accepted rather
   than rejected at split enumeration time.

The C++ worker-side changes (protocol enum + connector conversion) will
follow in a separate diff.

Differential Revision: D97531557
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Mar 27, 2026
…ocol and connector layer (prestodb#27394)

Summary:

This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.

Changes:
1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
   serialization table in presto_protocol_iceberg.{h,cpp}.
2. Handles PUFFIN in toVeloxFileFormat() in
   IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
   placeholder since DeletionVectorReader reads raw binary and
   does not use the DWRF/Parquet reader infrastructure.

== RELEASE NOTES ==
General Changes
* Upgrade Apache Iceberg library from 1.10.0 to 1.10.1.
Hive Connector Changes
* Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure.
* Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips.
* Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types.
* Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation.
* Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance.
* Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables.
* Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables.
* Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution.
* Improve IcebergSplitReader error handling and fix test file handle leaks.
* Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries.

Differential Revision: D97531555
apurva-meta added a commit to apurva-meta/presto that referenced this pull request Mar 27, 2026
…ocol and connector layer (prestodb#27394)

Summary:

This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.

Changes:
1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
   serialization table in presto_protocol_iceberg.{h,cpp}.
2. Handles PUFFIN in toVeloxFileFormat() in
   IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
   placeholder since DeletionVectorReader reads raw binary and
   does not use the DWRF/Parquet reader infrastructure.

== RELEASE NOTES ==
General Changes
* Upgrade Apache Iceberg library from 1.10.0 to 1.10.1.
Hive Connector Changes
* Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure.
* Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips.
* Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types.
* Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation.
* Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance.
* Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables.
* Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables.
* Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution.
* Improve IcebergSplitReader error handling and fix test file handle leaks.
* Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries.

Differential Revision: D97531555
…nd connector layer

Summary:
This is the C++ counterpart to the Java PUFFIN support diff. It wires
the PUFFIN file format through the Prestissimo protocol and connector
conversion layer so that Iceberg V3 deletion vector files can be
deserialized and handled by native workers.

Changes:
1. Adds PUFFIN to the C++ protocol FileFormat enum and its JSON
   serialization table in presto_protocol_iceberg.{h,cpp}.
2. Handles PUFFIN in toVeloxFileFormat() in
   IcebergPrestoToVeloxConnector.cpp, mapping it to DWRF as a
   placeholder since DeletionVectorReader reads raw binary and
   does not use the DWRF/Parquet reader infrastructure.

Differential Revision: D97531555
@meta-codesync meta-codesync bot changed the title feat: [prestissimo][iceberg] Wire PUFFIN file format through C++ protocol and connector layer (#27394) [prestissimo][iceberg] Wire PUFFIN file format through C++ protocol and connector layer Mar 27, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

CLA Missing ID CLA Not Signed

@steveburnett
Copy link
Copy Markdown
Contributor

  • Please sign the Presto CLA.

  • Please add a release note - or NO RELEASE NOTE - following the Release Notes Guidelines to pass the failing but not required CI check.

  • Please edit the PR title to follow semantic commit style to pass the failing and required CI check. See the failure in the test for advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants