[feat][cp] support iceberg by guhaiyan0221 · Pull Request #488 · bytedance/bolt

guhaiyan0221 · 2026-04-07T14:03:50Z

What problem does this PR solve?

Issue Number: close #191

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

Support iceberg connector and iceberg functions

Performance Impact

No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

Positive Impact: I have run benchmarks.

Click to view Benchmark Results

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

yingsu00 · 2026-04-08T07:52:13Z

@guhaiyan0221

As the original author of the Velox Iceberg code, I’d strongly recommend not porting that implementation directly into Bolt.

The current Iceberg implementation in Velox lives inside the Hive connector, but that was never the intended design. It was a compromise at the time to get the code merged, and in hindsight it’s a clear anti-pattern. Over time, the situation has only worsened as more Hive-specific assumptions were added and shared across both Hive and Iceberg paths.

There are several concrete issues with the current design:

Incorrect abstraction via inheritance
IcebergColumnHandle inherits from HiveColumnHandle, but their semantics diverge. For example:

HiveColumnHandle::ColumnType:
- kPartitionKey
- kRegular
- kSynthesized
- kRowIndex
- kRowId

kRowIndex and kRowId do not have the same meaning in Iceberg. Despite that, Iceberg is forced into Hive’s abstraction, which leads to:

incorrect modeling of Iceberg concepts
leaking Hive semantics into Iceberg
hacks in execution paths (e.g. ScanSpec handling)
Lack of connector isolation

Iceberg is not a standalone connector, so:

it does not have its own configuration surface
it cannot evolve independently
changes to Hive risk breaking Iceberg (and vice versa)

This is fundamentally limiting, especially since Iceberg has many connector-specific behaviors and configs.

I’ve been working on a plan to introduce Iceberg the right way, with proper separation and extensibility. Please see #107

The first step is refactoring the connector architecture to remove Hive coupling. This work is already in progress with @ZacBlanco:

Relevant PRs:

#156
#251
#361
#397
#484
Recommendation

I strongly recommend not merging a direct port of the Velox Iceberg code at this stage. If we do so it would:

reintroduce the same structural issues into Bolt
tightly couple Iceberg with Hive again
make future cleanup significantly more expensive

Instead, once the connector refactor is complete, we can:

introduce Iceberg as a standalone connector
define clean abstractions (TableHandle, ColumnHandle, Partitioning, etc.)
avoid inheriting incorrect Hive semantics

This will save us substantial rework and give us a much cleaner foundation going forward.

cc @frankobe @FelixYBW

If you want, I'm willing to rush into the second step to extract common code path from Hive and introduce the Iceberg connector. I had the code already last year.

guhaiyan0221 · 2026-04-09T02:22:57Z

@guhaiyan0221

As the original author of the Velox Iceberg code, I’d strongly recommend not porting that implementation directly into Bolt.

The current Iceberg implementation in Velox lives inside the Hive connector, but that was never the intended design. It was a compromise at the time to get the code merged, and in hindsight it’s a clear anti-pattern. Over time, the situation has only worsened as more Hive-specific assumptions were added and shared across both Hive and Iceberg paths.

There are several concrete issues with the current design:

Incorrect abstraction via inheritance
IcebergColumnHandle inherits from HiveColumnHandle, but their semantics diverge. For example:
HiveColumnHandle::ColumnType:
- kPartitionKey
- kRegular
- kSynthesized
- kRowIndex
- kRowId
kRowIndex and kRowId do not have the same meaning in Iceberg. Despite that, Iceberg is forced into Hive’s abstraction, which leads to:

incorrect modeling of Iceberg concepts

leaking Hive semantics into Iceberg

hacks in execution paths (e.g. ScanSpec handling)

Lack of connector isolation

Iceberg is not a standalone connector, so:

it does not have its own configuration surface

it cannot evolve independently

changes to Hive risk breaking Iceberg (and vice versa)

This is fundamentally limiting, especially since Iceberg has many connector-specific behaviors and configs.

I’ve been working on a plan to introduce Iceberg the right way, with proper separation and extensibility. Please see #107

The first step is refactoring the connector architecture to remove Hive coupling. This work is already in progress with @ZacBlanco:

Relevant PRs:

#156 #251 #361 #397 #484 Recommendation

I strongly recommend not merging a direct port of the Velox Iceberg code at this stage. If we do so it would:

reintroduce the same structural issues into Bolt

tightly couple Iceberg with Hive again

make future cleanup significantly more expensive

Instead, once the connector refactor is complete, we can:

introduce Iceberg as a standalone connector

define clean abstractions (TableHandle, ColumnHandle, Partitioning, etc.)

avoid inheriting incorrect Hive semantics

This will save us substantial rework and give us a much cleaner foundation going forward.

cc @frankobe @FelixYBW

If you want, I'm willing to rush into the second step to extract common code path from Hive and introduce the Iceberg connector. I had the code already last year.

Thanks for the very detailed explanation! I fully agree with your analysis and suggestion. I should avoid directly porting the existing Iceberg implementation and instead wait for the connector refactoring to introduce Iceberg properly as a standalone connector. Really appreciate your guidance!

guhaiyan0221 marked this pull request as draft April 7, 2026 14:04

guhaiyan0221 added 2 commits April 8, 2026 11:52

refactor: extract hive/parquet groundwork for iceberg

4a6879d

feat: add iceberg connector support

37c084d

guhaiyan0221 force-pushed the fix_cp_iceberg branch from 880340f to 37c084d Compare April 8, 2026 03:52

refactor: align bolt iceberg

1c2d86d

guhaiyan0221 force-pushed the fix_cp_iceberg branch from 7f79bda to 1c2d86d Compare April 8, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat][cp] support iceberg#488

[feat][cp] support iceberg#488
guhaiyan0221 wants to merge 3 commits intobytedance:mainfrom
guhaiyan0221:fix_cp_iceberg

guhaiyan0221 commented Apr 7, 2026

Uh oh!

yingsu00 commented Apr 8, 2026 •

edited

Loading

Uh oh!

guhaiyan0221 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guhaiyan0221 commented Apr 7, 2026

What problem does this PR solve?

Type of Change

Description

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

yingsu00 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guhaiyan0221 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yingsu00 commented Apr 8, 2026 •

edited

Loading