Skip to content

[feat][cp] support iceberg#488

Draft
guhaiyan0221 wants to merge 3 commits intobytedance:mainfrom
guhaiyan0221:fix_cp_iceberg
Draft

[feat][cp] support iceberg#488
guhaiyan0221 wants to merge 3 commits intobytedance:mainfrom
guhaiyan0221:fix_cp_iceberg

Conversation

@guhaiyan0221
Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #191

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Support iceberg connector and iceberg functions

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@guhaiyan0221 guhaiyan0221 marked this pull request as draft April 7, 2026 14:04
@yingsu00
Copy link
Copy Markdown
Contributor

yingsu00 commented Apr 8, 2026

@guhaiyan0221

As the original author of the Velox Iceberg code, I’d strongly recommend not porting that implementation directly into Bolt.

The current Iceberg implementation in Velox lives inside the Hive connector, but that was never the intended design. It was a compromise at the time to get the code merged, and in hindsight it’s a clear anti-pattern. Over time, the situation has only worsened as more Hive-specific assumptions were added and shared across both Hive and Iceberg paths.

There are several concrete issues with the current design:

  1. Incorrect abstraction via inheritance
    IcebergColumnHandle inherits from HiveColumnHandle, but their semantics diverge. For example:
HiveColumnHandle::ColumnType:
- kPartitionKey
- kRegular
- kSynthesized
- kRowIndex
- kRowId

kRowIndex and kRowId do not have the same meaning in Iceberg. Despite that, Iceberg is forced into Hive’s abstraction, which leads to:

  • incorrect modeling of Iceberg concepts
  • leaking Hive semantics into Iceberg
  • hacks in execution paths (e.g. ScanSpec handling)
  • Lack of connector isolation
  1. Iceberg is not a standalone connector, so:
  • it does not have its own configuration surface
  • it cannot evolve independently
  • changes to Hive risk breaking Iceberg (and vice versa)

This is fundamentally limiting, especially since Iceberg has many connector-specific behaviors and configs.

I’ve been working on a plan to introduce Iceberg the right way, with proper separation and extensibility. Please see #107

The first step is refactoring the connector architecture to remove Hive coupling. This work is already in progress with @ZacBlanco:

Relevant PRs:

#156
#251
#361
#397
#484
Recommendation

I strongly recommend not merging a direct port of the Velox Iceberg code at this stage. If we do so it would:

  • reintroduce the same structural issues into Bolt
  • tightly couple Iceberg with Hive again
  • make future cleanup significantly more expensive

Instead, once the connector refactor is complete, we can:

  • introduce Iceberg as a standalone connector
  • define clean abstractions (TableHandle, ColumnHandle, Partitioning, etc.)
  • avoid inheriting incorrect Hive semantics

This will save us substantial rework and give us a much cleaner foundation going forward.

cc @frankobe @FelixYBW

If you want, I'm willing to rush into the second step to extract common code path from Hive and introduce the Iceberg connector. I had the code already last year.

@guhaiyan0221
Copy link
Copy Markdown
Collaborator Author

@guhaiyan0221

As the original author of the Velox Iceberg code, I’d strongly recommend not porting that implementation directly into Bolt.

The current Iceberg implementation in Velox lives inside the Hive connector, but that was never the intended design. It was a compromise at the time to get the code merged, and in hindsight it’s a clear anti-pattern. Over time, the situation has only worsened as more Hive-specific assumptions were added and shared across both Hive and Iceberg paths.

There are several concrete issues with the current design:

  1. Incorrect abstraction via inheritance
    IcebergColumnHandle inherits from HiveColumnHandle, but their semantics diverge. For example:
HiveColumnHandle::ColumnType:
- kPartitionKey
- kRegular
- kSynthesized
- kRowIndex
- kRowId

kRowIndex and kRowId do not have the same meaning in Iceberg. Despite that, Iceberg is forced into Hive’s abstraction, which leads to:

  • incorrect modeling of Iceberg concepts
  • leaking Hive semantics into Iceberg
  • hacks in execution paths (e.g. ScanSpec handling)
  • Lack of connector isolation
  1. Iceberg is not a standalone connector, so:
  • it does not have its own configuration surface
  • it cannot evolve independently
  • changes to Hive risk breaking Iceberg (and vice versa)

This is fundamentally limiting, especially since Iceberg has many connector-specific behaviors and configs.

I’ve been working on a plan to introduce Iceberg the right way, with proper separation and extensibility. Please see #107

The first step is refactoring the connector architecture to remove Hive coupling. This work is already in progress with @ZacBlanco:

Relevant PRs:

#156 #251 #361 #397 #484 Recommendation

I strongly recommend not merging a direct port of the Velox Iceberg code at this stage. If we do so it would:

  • reintroduce the same structural issues into Bolt
  • tightly couple Iceberg with Hive again
  • make future cleanup significantly more expensive

Instead, once the connector refactor is complete, we can:

  • introduce Iceberg as a standalone connector
  • define clean abstractions (TableHandle, ColumnHandle, Partitioning, etc.)
  • avoid inheriting incorrect Hive semantics

This will save us substantial rework and give us a much cleaner foundation going forward.

cc @frankobe @FelixYBW

If you want, I'm willing to rush into the second step to extract common code path from Hive and introduce the Iceberg connector. I had the code already last year.

Thanks for the very detailed explanation! I fully agree with your analysis and suggestion. I should avoid directly porting the existing Iceberg implementation and instead wait for the connector refactoring to introduce Iceberg properly as a standalone connector. Really appreciate your guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants