Add DWRF file format support for Iceberg data sink by apurva-meta · Pull Request #16875 · facebookincubator/velox

apurva-meta · 2026-03-21T06:35:39Z

Summary:

Add DWRF file format support in IcebergDataSink for read and write paths
Update BUCK build targets for DWRF dependencies
Add IcebergDwrfInsertTest with comprehensive insert/read tests for DWRF format
Update CMakeLists.txt for new test targets

Differential Revision: D97530138

Summary: - Add explicit equality deletes NYI branch in prepareSplit() - Improve VELOX_NYI error messages with descriptive content type info - Fix FILE handle leaks in IcebergReadTest by extracting getTestFileSize() helper - Minor doc comment formatting improvements in IcebergSplitReader.h Differential Revision: D97530140

netlify · 2026-03-21T06:35:46Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`7d9e3d8`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/69c61b624076a40008cc98ec

meta-codesync · 2026-03-21T06:36:18Z

@apurva-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97530138.

PingLiuPing

Hi @apurva-meta, thanks for the code.

I noticed that several recently opened PRs seem to overlap. Could you please open a GitHub issue outlining your overall plan and how these changes fit together? If the work is still in progress, it would be great to mark the PRs as drafts for now.

jinchengchenghh · 2026-03-23T15:38:04Z

Why you add dwrf? Iceberg does not support dwrf, it only supports ORC, if you write as dwrf, Apache Iceberg cannot read it.

apurva-meta · 2026-03-23T22:45:15Z

Velox stack
1 | D97530140 | Improve IcebergSplitReader error handling and fix test file handle leaks | #16869
2 | D97530142 | Add Iceberg V3 deletion vector support (DV reader) | #16870
3 | D97530141 | Add Iceberg equality delete file reader | #16871
4 | D97530136 | Add sequence number conflict resolution for equality deletes | #16872
5 | D97530139 | Add sequence number conflict resolution for positional deletes and DVs | #16873
6 | D97530137 | Add Iceberg V3 deletion vector writer | #16874
7 | D97530138 | Add DWRF file format support for Iceberg data sink | #16875
8 | D97599411 | Add Manifold filesystem support with CAT token auth | No GitHub PR (fb-internal only)

apurva-meta · 2026-03-23T22:46:00Z

Velox stack 1 | D97530140 | Improve IcebergSplitReader error handling and fix test file handle leaks | #16869 2 | D97530142 | Add Iceberg V3 deletion vector support (DV reader) | #16870 3 | D97530141 | Add Iceberg equality delete file reader | #16871 4 | D97530136 | Add sequence number conflict resolution for equality deletes | #16872 5 | D97530139 | Add sequence number conflict resolution for positional deletes and DVs | #16873 6 | D97530137 | Add Iceberg V3 deletion vector writer | #16874 7 | D97530138 | Add DWRF file format support for Iceberg data sink | #16875 8 | D97599411 | Add Manifold filesystem support with CAT token auth | No GitHub PR (fb-internal only)

Presto stack:

| Diff | Title | GitHub PR

apurva-meta · 2026-03-23T22:46:44Z

Velox stack 1 | D97530140 | Improve IcebergSplitReader error handling and fix test file handle leaks | #16869 2 | D97530142 | Add Iceberg V3 deletion vector support (DV reader) | #16870 3 | D97530141 | Add Iceberg equality delete file reader | #16871 4 | D97530136 | Add sequence number conflict resolution for equality deletes | #16872 5 | D97530139 | Add sequence number conflict resolution for positional deletes and DVs | #16873 6 | D97530137 | Add Iceberg V3 deletion vector writer | #16874 7 | D97530138 | Add DWRF file format support for Iceberg data sink | #16875 8 | D97599411 | Add Manifold filesystem support with CAT token auth | No GitHub PR (fb-internal only)

Presto stack:

| Diff | Title | GitHub PR

-- | -- | -- | -- 1 | D97531548 | Reformat FileContent enum | prestodb/presto#27391 2 | D97531547 | Wire dataSequenceNumber through protocol layer | prestodb/presto#27392 3 | D97531557 | Add PUFFIN format for DV discovery | prestodb/presto#27393 4 | D97531555 | Wire PUFFIN through C++ protocol | prestodb/presto#27394 5 | D97531549 | Add DV write path + compaction | prestodb/presto#27395 6 | D97531552 | Add TIMESTAMP_NANO support | prestodb/presto#27396 7 | D97531551 | Add Variant type support | prestodb/presto#27397 8 | D97531550 | Upgrade Iceberg 1.10.0 → 1.10.1 | prestodb/presto#27399 9 | D97531553 | Add e2e integration tests (TPC-DS) | prestodb/presto#27400 10 | D97531546 | Add DWRF file format support | prestodb/presto#27401 11 | D97599433 | Add commit_table_data CAS support | prestodb/presto#27414 12 | D97602693 | Enable V3 row-level operations | prestodb/presto#27415

@PingLiuPing Request you to review above PRs. I will remove the duplicates (not listed here).

apurva-meta · 2026-03-23T22:55:08Z

Why you add dwrf? Iceberg does not support dwrf, it only supports ORC, if you write as dwrf, Apache Iceberg cannot read it.

Good question. DWRF is Meta's internal fork of ORC with optimizations like FlatMap encoding and dictionary sharing. This support is additive and opt-in — PARQUET remains the default. DWRF is only used when explicitly configured as the storage format.

The use case is Meta's internal deployment where the entire read/write stack is Presto/Velox, which supports DWRF natively. Velox has first-class DWRF support — it includes a full DWRF reader (velox/dwio/dwrf/reader/) and writer (velox/dwio/dwrf/writer/) that are registered via registerDwrfReaderFactory() / registerDwrfWriterFactory(). DWRF is actually the most heavily used file format in Velox at Meta, powering the vast majority of warehouse workloads. Within that closed ecosystem, both the writer (IcebergDataSink) and reader (IcebergSplitReader) handle DWRF transparently via these registered factories.

The Iceberg metadata correctly records "fileFormat": "DWRF" in commit messages, so the reader always knows the actual file format. For OSS users, the default PARQUET path is unchanged — this diff doesn't affect that at all.

You're right that standard Apache Iceberg tools (Spark, Trino, Flink) cannot read DWRF files. This is intentional — DWRF support is for environments where the entire stack (writer + reader) supports it. It's the same pattern as how the ORC format enum already exists in Iceberg despite not all tools supporting all ORC features equally.

Summary: Add deletion vector (DV) reader to the Velox Iceberg connector, enabling Iceberg V3 spec support for row-level deletes. Iceberg V3 replaces positional delete files with deletion vectors — compact roaring bitmaps stored as blobs inside Puffin files. Compared to V2 positional delete files, DVs are more compact and avoid sorted merge of multiple delete files at read time. Changes: - IcebergDeleteFile.h: Add FileContent::kDeletionVector enum value - DeletionVectorReader.h/cpp: New reader that loads a Puffin blob, deserializes the roaring bitmap portable format (array, bitset, and run containers), and sets bits in the deleteBitmap for the current batch range. No CRoaring dependency — self-contained parser. - IcebergSplitReader.h: Add deletionVectorReaders_ member, include header - IcebergSplitReader.cpp: Route kDeletionVector in prepareSplit(), apply DVs alongside positional deletes in next() - BUCK: Add DeletionVectorReader library target Differential Revision: D97530142

Summary: Implements Iceberg equality delete support for the Velox Iceberg connector. Equality delete files contain rows with values for one or more columns (identified by equalityFieldIds). A base data row is deleted if its values match ALL specified columns of ANY row in the delete file. The implementation: - Adds EqualityDeleteFileReader that eagerly reads the entire delete file and builds an in-memory hash multimap of delete key tuples during construction. - Wires EqualityDeleteFileReader into IcebergSplitReader::prepareSplit() to resolve equalityFieldIds to column names/types from the table schema, and into IcebergSplitReader::next() to apply post-read equality delete filtering with row compaction. - Handles lazy vectors from file readers via loadedVector() before accessing values for hashing and comparison. - Supports BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, VARCHAR, VARBINARY, and TIMESTAMP column types. Differential Revision: D97530141

Summary: Implements Iceberg V2+ sequence number conflict resolution for equality delete files. Per the Iceberg spec, an equality delete file should only be applied to data files whose data sequence number is strictly less than the delete file's data sequence number. This prevents concurrent write conflicts where a delete file written concurrently with a data file could incorrectly delete rows that were not intended to be deleted. Changes: - Added `dataSequenceNumber` field to `IcebergDeleteFile` struct with default value 0 (unassigned/legacy V1). When 0, sequence number filtering is disabled for backward compatibility. - Added `dataSequenceNumber` field to `HiveIcebergSplit` to carry the base data file's sequence number. - Updated `IcebergSplitReader::prepareSplit()` to skip equality delete files when `deleteFile.dataSequenceNumber <= split.dataSequenceNumber` (unless either is 0, which disables the check). - Updated test constructor of `HiveIcebergSplit` to accept the new `dataSequenceNumber` parameter. Differential Revision: D97530136

…letion vectors Summary: Extend the sequence number conflict resolution logic to positional deletes and deletion vectors. Per the Iceberg spec: - Positional deletes and DVs skip when deleteFileSeqNum < dataFileSeqNum (strictly less than, unlike equality deletes which use <=) - Sequence number 0 (V1 legacy) never skips Changes: - IcebergSplitReader: Apply sequence number filtering for positional deletes and deletion vectors before passing to readers - IcebergReadFile: Store dataSequenceNumber for conflict resolution - Tests: Add sequence number conflict resolution tests for positional deletes and DVs Differential Revision: D97530139

Summary: - Add DeletionVectorWriter.cpp/h implementing DV file writing using RoaringBitmapArray - Support both position-based (positional deletes) and DV-based delete file writing - Add DeletionVectorWriterTest with comprehensive unit tests - Update BUCK and CMakeLists.txt build targets Differential Revision: D97530137

apurva-meta · 2026-03-24T21:41:18Z

Why you add dwrf? Iceberg does not support dwrf, it only supports ORC, if you write as dwrf, Apache Iceberg cannot read it.

Good question. DWRF is Meta's internal fork of ORC with optimizations like FlatMap encoding and dictionary sharing. This support is additive and opt-in — PARQUET remains the default. DWRF is only used when explicitly configured as the storage format.
The use case is Meta's internal deployment where the entire read/write stack is Presto/Velox, which supports DWRF natively. Velox has first-class DWRF support — it includes a full DWRF reader (velox/dwio/dwrf/reader/) and writer (velox/dwio/dwrf/writer/) that are registered via registerDwrfReaderFactory() / registerDwrfWriterFactory(). DWRF is actually the most heavily used file format in Velox at Meta, powering the vast majority of warehouse workloads. Within that closed ecosystem, both the writer (IcebergDataSink) and reader (IcebergSplitReader) handle DWRF transparently via these registered factories.
The Iceberg metadata correctly records "fileFormat": "DWRF" in commit messages, so the reader always knows the actual file format. For OSS users, the default PARQUET path is unchanged — this diff doesn't affect that at all.
You're right that standard Apache Iceberg tools (Spark, Trino, Flink) cannot read DWRF files. This is intentional — DWRF support is for environments where the entire stack (writer + reader) supports it. It's the same pattern as how the ORC format enum already exists in Iceberg despite not all tools supporting all ORC features equally.

DWRF is also open sourced: https://github.com/prestodb/presto-hive-dwrf

jinchengchenghh · 2026-03-25T09:55:58Z

Now Velox does not support fully Iceberg parquet, for example, schema evolution, if user add a column, then drop column, then add a new column with same name, in Iceberg, they are the different column, so in Gluten, we will fallback this case, and read by JVM Iceberg, in another word, write native, read JVM, but for dwrf, it cannot fallback then, this means we lack for the schema evolution, forbid users from performing schema evolution or user cannot read the files if they alter the format to parquet and add column. Iceberg supports multiple formats in one table.

DWRF does not have a significant advantage over Parquet, so I suggest to consider seriously to expose this feature to user at this time.

Summary: - Add DWRF file format support in IcebergDataSink for read and write paths - Update BUCK build targets for DWRF dependencies - Add IcebergDwrfInsertTest with comprehensive insert/read tests for DWRF format - Update CMakeLists.txt for new test targets Differential Revision: D97530138

apurva-meta requested a review from majetideepak as a code owner March 21, 2026 06:35

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 21, 2026

PingLiuPing reviewed Mar 21, 2026

View reviewed changes

Apurva Kumar added 5 commits March 24, 2026 13:12

apurva-meta mentioned this pull request Mar 24, 2026

[native]: Iceberg V3 support prestodb/presto#27198

Open

18 tasks

apurva-meta force-pushed the export-D97530138 branch from fdfb9fe to 7d9e3d8 Compare March 27, 2026 05:53

meta-codesync bot changed the title ~~feat: [velox][iceberg] Add DWRF file format support for Iceberg data sink~~ Add DWRF file format support for Iceberg data sink Mar 27, 2026

apurva-meta closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DWRF file format support for Iceberg data sink#16875

Add DWRF file format support for Iceberg data sink#16875
apurva-meta wants to merge 7 commits intofacebookincubator:mainfrom
apurva-meta:export-D97530138

apurva-meta commented Mar 21, 2026

Uh oh!

netlify bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Mar 21, 2026

Uh oh!

PingLiuPing left a comment

Uh oh!

jinchengchenghh commented Mar 23, 2026

Uh oh!

apurva-meta commented Mar 23, 2026

Uh oh!

apurva-meta commented Mar 23, 2026

Uh oh!

apurva-meta commented Mar 23, 2026

| Diff | Title | GitHub PR

Uh oh!

apurva-meta commented Mar 23, 2026

Uh oh!

apurva-meta commented Mar 24, 2026

Uh oh!

jinchengchenghh commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

apurva-meta commented Mar 21, 2026

Uh oh!

netlify bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

meta-codesync bot commented Mar 21, 2026

Uh oh!

PingLiuPing left a comment

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh commented Mar 23, 2026

Uh oh!

apurva-meta commented Mar 23, 2026

Uh oh!

apurva-meta commented Mar 23, 2026

| Diff | Title | GitHub PR

Uh oh!

apurva-meta commented Mar 23, 2026

| Diff | Title | GitHub PR

Uh oh!

apurva-meta commented Mar 23, 2026

Uh oh!

apurva-meta commented Mar 24, 2026

Uh oh!

jinchengchenghh commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Mar 21, 2026 •

edited

Loading