feat: Support Iceberg clustered writer #14670

PingLiuPing · 2025-09-01T20:15:02Z

There are two writer mode supported by iceberg, one is fanout (default), the other one clustered.
This PR implements clustered writer mode.
When using clustered input mode, the input data is assumed to be clustered/partitioned beforehand.

Co-authored-by [email protected]

netlify · 2025-09-01T20:15:08Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`f02ffce`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/68b6ed5c3a3f0e00082a8af1

velox/connectors/hive/iceberg/IcebergDataSink.cpp

velox/connectors/hive/iceberg/IcebergDataSink.h

Co-authored-by: Chengcheng Jin <[email protected]>

…finity.

Add Protobuf struct IcebergPartitionField to transfer the iceberg id information, add IcebergPartitionSpec to transfer partition information. Build with test and benchmark in CI and fix IcebergWriteTest build. Set the file format to orc to bypass native parquet write for partitioned tpch iceberg suite, after facebookincubator/velox#14670 which supports fanout false mode merged, we can relax the restriction. Relevant PR: facebookincubator/velox#13874

jinchengchenghh · 2025-09-03T14:29:41Z

velox/connectors/hive/iceberg/IcebergDataSink.h

+  std::unique_ptr<DataFileStatsCollector> icebergStatsCollector_;
+
+  // Below are structures for clustered mode writer.
+  const bool fanoutEnabled_;


Move const member upfront

majetideepak · 2025-09-11T15:14:29Z

@PingLiuPing Can you add more details on the iceberg clustered writes?

When using clustered input mode, the input data is assumed to be clustered/partitioned beforehand.

In the current implementation, who handles the clustering and partitioning?

PingLiuPing · 2025-09-11T15:45:18Z

@majetideepak , thanks for the comment.

In the current implementation, who handles the clustering and partitioning?

It should be processed by the caller to guarantee the input data is partitioned first. For example, for identity partition transform (same with Hive), and suppose the partition column type is integer. The valid input stream is 1,1,1,2,2,2,3,3,3. But if the input stream is 1,1,1,2,2,1, then the cluster mode will report error.

majetideepak · 2025-09-11T17:40:21Z

I now see that this PR corresponds to the last commit and the other commits are part of other PRs.

yingsu00 · 2025-09-11T02:30:44Z

velox/connectors/hive/iceberg/IcebergColumnHandle.h

+
+namespace facebook::velox::connector::hive::iceberg {
+
+struct IcebergNestedField {


ColumnIdentity. Both Presto and Trino are using this name.

@yingsu00 Thanks for the comments. This PR contains 7 commits but only the last commit is for clustered writer. I have to do this to pass the CI because it depends the previous commits but they are not merged.
Sorry for the confuse.

velox/connectors/hive/iceberg/IcebergColumnHandle.h

yingsu00 · 2025-09-11T02:31:55Z

velox/connectors/hive/iceberg/IcebergDataSink.cpp

+
+  std::function<IcebergDataFileStatsSettings(
+      const IcebergNestedField&, const TypePtr&, bool)>
+      buildNestedField = [&](const IcebergNestedField& f,


Would it be possible to make buildNestedField a private function instead of a lambda here? This is hard to read.

yingsu00 · 2025-09-11T02:32:28Z

velox/connectors/hive/iceberg/IcebergDataSink.cpp

+                             bool skipBounds) -> IcebergDataFileStatsSettings {
+    VELOX_CHECK_NOT_NULL(type, "Input column type cannot be null.");
+    bool currentSkipBounds = skipBounds || type->isMap() || type->isArray();
+    IcebergDataFileStatsSettings field(f.id, currentSkipBounds);


The name "field" is confusing. Can we just name it statsSettings?

PingLiuPing · 2025-09-17T09:45:51Z

@yingsu00 Thanks for the comments, I will address your comments when split these huge PRs to smaller pieces.

Support insert data into iceberg table.

5882b63

Co-authored-by [email protected]

PingLiuPing requested a review from jinchengchenghh September 1, 2025 20:15

PingLiuPing self-assigned this Sep 1, 2025

PingLiuPing requested a review from majetideepak as a code owner September 1, 2025 20:15

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 1, 2025

jinchengchenghh reviewed Sep 2, 2025

View reviewed changes

PingLiuPing force-pushed the lp_iceberg_clustered_writer branch from 65e8480 to dc64f0a Compare September 2, 2025 10:19

jinchengchenghh reviewed Sep 2, 2025

View reviewed changes

velox/connectors/hive/iceberg/IcebergDataSink.h Outdated Show resolved Hide resolved

PingLiuPing force-pushed the lp_iceberg_clustered_writer branch from dc64f0a to 298c0c4 Compare September 2, 2025 10:38

PingLiuPing and others added 6 commits September 2, 2025 11:50

Add iceberg partition transforms.

f5172db

Co-authored-by: Chengcheng Jin <[email protected]>

Collect Iceberg data file statistics in dwio.

4cfefe5

Fix incorrect min max stats when the column value are infinity or -in…

6565f2d

…finity.

Integrate Iceberg data file statistics and adding unit test.

bbf92c8

Implement iceberg sort order

3be0f46

Add clustered Iceberg writer mode.

f02ffce

PingLiuPing force-pushed the lp_iceberg_clustered_writer branch from 298c0c4 to f02ffce Compare September 2, 2025 13:12

jinchengchenghh mentioned this pull request Sep 3, 2025

[GLUTEN-9335][VL] Support iceberg partition write apache/incubator-gluten#10497

Merged

jinchengchenghh mentioned this pull request Sep 3, 2025

Default fanout false mode support apache/incubator-gluten#10617

Open

jinchengchenghh approved these changes Sep 3, 2025

View reviewed changes

jinchengchenghh reviewed Sep 3, 2025

View reviewed changes

yingsu00 reviewed Sep 11, 2025

View reviewed changes


		namespace facebook::velox::connector::hive::iceberg {

		struct IcebergNestedField {

feat: Support Iceberg clustered writer #14670

Are you sure you want to change the base?

feat: Support Iceberg clustered writer #14670

Uh oh!

Conversation

PingLiuPing commented Sep 1, 2025

Uh oh!

netlify bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jinchengchenghh Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

majetideepak commented Sep 11, 2025

Uh oh!

PingLiuPing commented Sep 11, 2025

Uh oh!

majetideepak commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yingsu00 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

PingLiuPing Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yingsu00 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

yingsu00 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

PingLiuPing commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Sep 1, 2025 •

edited

Loading

majetideepak commented Sep 11, 2025 •

edited

Loading