Skip to content

Conversation

@imjalpreet
Copy link
Contributor

@imjalpreet imjalpreet commented Sep 13, 2024

This PR implements support for Iceberg table insertions in Velox, enabling write operations to Iceberg tables through the Hive connector.

Changes

  • Implemented IcebergDataSink class that extends HiveDataSink to handle Iceberg-specific write operations.
  • Added IcebergInsertTableHandle to manage Iceberg table insertion metadata.
  • Added IcebergPartitionField, IcebergPartitionSpec to support Iceberg's partition transform specification, this PR only supports identity transform for now.
  • Implemented JSON serialization for partition values to support Iceberg metadata requirements.

Design Doc

Implementing_Iceberg_Insertion_Design.md

Implementation Details

The implementation follows Iceberg's table format specification, particularly for handling partitioning and metadata. Key components include:

  1. IcebergPartitionSpec: Manages partition specifications with support for different transform types.
  2. IcebergDataSink: Extends HiveDataSink to handle Iceberg-specific write operations.
  3. Add extra parameters to HiveDataSink for dataChannels.
  4. Adjust HiveDataSink data members and methods visibility for IcebergDataSink to use.
    The PR also includes test infrastructure for validating Iceberg insertions with various partition strategies.

Testing

Added unit tests that verify:

  • Basic table writes to non-partitioned tables
  • Partitioned table writes with single column partition and multiple column partition

Limitation

This PR only support iceberg identity partition transform. And it only support primitive column type as partition column, nested column type such as struct is not supported.

All tests pass on the current codebase.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2024
@netlify
Copy link

netlify bot commented Sep 13, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5882b63
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/68b5e3bb53455e00091052d4

@Yuhta Yuhta requested a review from xiaoxmeng September 16, 2024 18:07
@imjalpreet imjalpreet force-pushed the icebergWriter branch 2 times, most recently from f3565a6 to e3aa199 Compare November 14, 2024 23:18
@zhouyuan
Copy link
Collaborator

FWIW, there's a new project to add C++ support for iceberg:
apache/iceberg-cpp#2

@imjalpreet imjalpreet marked this pull request as ready for review January 2, 2025 15:37
@imjalpreet imjalpreet marked this pull request as draft January 2, 2025 15:42
@imjalpreet imjalpreet changed the title Add support for writing iceberg tables feat(iceberg): Add support for writing iceberg tables Jan 2, 2025
@prestodb-ci
Copy link

@majetideepak imported this issue into IBM GitHub Enterprise

Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjalpreet Did you remove the tests?

@imjalpreet imjalpreet force-pushed the icebergWriter branch 3 times, most recently from 2c9e448 to 2f1d227 Compare February 14, 2025 02:37
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjalpreet Can you please make the test file format agnostic? You can use ReaderFactory::createReader() to create the readers instead of directly calling into the Parquet reader constructor.

Copy link
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I create iceberg table based on your code, I hit following error, time data type is not supported?

presto:iceberg> insert into partition_t2 values (TIMESTAMP '2025-02-28 14:00:02', 11, DATE '2024-02-28', TIME '14:00:33', 128);
Query 20250303_133535_00004_gh5dp failed: inferredType Failed to parse type [time]. Type not registered.

My create table DDL is:

create table partition_t2 (c_timestamp timestamp, c_int int, c_date date, c_time time, c_bigint bigint) with (format='PARQUET', partitioning=ARRAY['year(c_date)']);

@imjalpreet
Copy link
Contributor Author

When I create iceberg table based on your code, I hit following error, time data type is not supported?

@PingLiuPing Velox does not yet support the Time Datatype, so we must first add support for it.

Types supported in Velox: https://facebookincubator.github.io/velox/develop/types.html

@jinchengchenghh
Copy link
Collaborator

It should be done in downstream, if you have the restriction, you need to verify it and describe what input is valid input in API comments.

jinchengchenghh added a commit to apache/incubator-gluten that referenced this pull request Jul 23, 2025
Base on this PR facebookincubator/velox#10996, which is merged to ibm/velox, and lacks for the metadata, so the read performance is not performed as expected. Use the flag --enable_enhanced_features to enable this feature, default disable.
Use org.apache.gluten.tags.EnhancedFeaturesTest test Tag on the specified enhanced features tests to exclude, exclude the tests default by profile exclude-tests, we cannot use the jni call to decide if run the tests because the library is not loaded when listing the tests.

Only supports Spark34, spark35 iceberg version 1.5.0 is not supported.

Supports parquet format because avro and orc write is not supported in Velox

Fallback the complex data type write because the metric does not support.
@PingLiuPing
Copy link
Collaborator

It should be done in downstream, if you have the restriction, you need to verify it and describe what input is valid input in API comments.

Let's clarify is it insertion failed or reading failed? If it is insertion failed, can you please check if the data and column name been passed into velox correctly, e.g, in IcebergDataSink::appendData . If it is reading failed, are you able to read the same data using Spark?

@jinchengchenghh
Copy link
Collaborator

The read is failed, it is a unit test in Gluten, before native write, it can pass

@PingLiuPing
Copy link
Collaborator

The read is failed, it is a unit test in Gluten, before native write, it can pass

Can you query this data that is written by Gluten from Spark?
I cannot reproduce this error from Pretissimo.

@jinchengchenghh
Copy link
Collaborator

Yes, Gluten without native write can query.
I have fallback this case, let us focus on statistic collect first.

@PingLiuPing
Copy link
Collaborator

Yes, Gluten without native write can query. I have fallback this case, let us focus on statistic collect first.

Ok, what I mean is using C++ insert data, and this will generate a data file. Then are you able to query this data file from Java (Spark).
Just want to figure out which component causes this error.

@PingLiuPing
Copy link
Collaborator

@Yuhta can you help review again? Thank you very much in advance.

@PingLiuPing
Copy link
Collaborator

The code change in Prestissimo will not be able get merged until velox PR is merged. To prevent the build break in other CI pipelines revert the cmake target name change in hive/iceberg/CMakeLists.txt first.

@jinchengchenghh
Copy link
Collaborator

Hi, @Yuhta @mbasmanova Could you help review this PR? This PR has been integrated with Gluten and Presto, and pass several unit tests back port from Apache Iceberg, apache/incubator-gluten#9397, I fallback the partition table because this PR only supports identity partitioning and lacks for metadata which causes some tests failed. The following prs will support all partition transforms, functions and metadata, after that, I will enable all the test.

Willing to see your reply, much thanks!

@jinchengchenghh
Copy link
Collaborator

Could you split to unpartitioned table support only?

@PingLiuPing
Copy link
Collaborator

Could you split to unpartitioned table support only?

Thank you, this is possible. And I'm also thinking to separate the partition name and commit message.

@PingLiuPing PingLiuPing changed the title feat(iceberg): Add support for writing iceberg tables feat(iceberg): Add support for writing iceberg tables (to be closed) Sep 4, 2025
@PingLiuPing
Copy link
Collaborator

@mbasmanova @jinchengchenghh Thank you very much for the review comments, I split the PR into multiple smaller PRs. The first one is #14723. In following PRs, I will add

  1. Iceberg partition spec (new files).
  2. Add new option for Timestamp::tmToStringView to meet Iceberg spec (completely standalone code).
  3. Customise iceberg writer options.
  4. Iceberg file name generator.
  5. Iceberg partition ID generator (new files).
  6. Identity partition transform.

meta-codesync bot pushed a commit that referenced this pull request Oct 3, 2025
Summary:
As per review request from #10996, we should split #10996 to multiple smaller PRs.

This is the first PR of them. It adds support for basic insertion without partition transforms or data file statistics. The implementation supports both primitive data types and nested types, including struct, map, and array.

A series of follow-up PRs will extend this work to make Iceberg write fully compatible with the Iceberg specification.

1. Iceberg partition spec (new files).
2. Add new option for Timestamp::tmToStringView to meet Iceberg spec (completely standalone code).
3. Customise iceberg writer options.
4. Iceberg file name generator.
5. Iceberg partition ID generator (new files).
6. Identity partition transform.

Pull Request resolved: #14723

Reviewed By: xiaoxmeng

Differential Revision: D83667667

Pulled By: kgpai

fbshipit-source-id: a3df1aeac432d6dde6610b9a7057e170a706de9e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants