-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat(iceberg): Add support for writing iceberg tables (to be closed) #10996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
dcb40ae to
02e6d32
Compare
02e6d32 to
9dde9fb
Compare
9dde9fb to
ff22387
Compare
f3565a6 to
e3aa199
Compare
e3aa199 to
967c556
Compare
|
FWIW, there's a new project to add C++ support for iceberg: |
967c556 to
a1ee3c5
Compare
a1ee3c5 to
0e6fbfc
Compare
|
@majetideepak imported this issue into IBM GitHub Enterprise |
yingsu00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjalpreet Did you remove the tests?
2c9e448 to
2f1d227
Compare
yingsu00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjalpreet Can you please make the test file format agnostic? You can use ReaderFactory::createReader() to create the readers instead of directly calling into the Parquet reader constructor.
2f1d227 to
dec50cb
Compare
PingLiuPing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I create iceberg table based on your code, I hit following error, time data type is not supported?
presto:iceberg> insert into partition_t2 values (TIMESTAMP '2025-02-28 14:00:02', 11, DATE '2024-02-28', TIME '14:00:33', 128);
Query 20250303_133535_00004_gh5dp failed: inferredType Failed to parse type [time]. Type not registered.
My create table DDL is:
create table partition_t2 (c_timestamp timestamp, c_int int, c_date date, c_time time, c_bigint bigint) with (format='PARQUET', partitioning=ARRAY['year(c_date)']);
@PingLiuPing Velox does not yet support the Time Datatype, so we must first add support for it. Types supported in Velox: https://facebookincubator.github.io/velox/develop/types.html |
|
It should be done in downstream, if you have the restriction, you need to verify it and describe what input is valid input in API comments. |
Base on this PR facebookincubator/velox#10996, which is merged to ibm/velox, and lacks for the metadata, so the read performance is not performed as expected. Use the flag --enable_enhanced_features to enable this feature, default disable. Use org.apache.gluten.tags.EnhancedFeaturesTest test Tag on the specified enhanced features tests to exclude, exclude the tests default by profile exclude-tests, we cannot use the jni call to decide if run the tests because the library is not loaded when listing the tests. Only supports Spark34, spark35 iceberg version 1.5.0 is not supported. Supports parquet format because avro and orc write is not supported in Velox Fallback the complex data type write because the metric does not support.
Let's clarify is it insertion failed or reading failed? If it is insertion failed, can you please check if the data and column name been passed into velox correctly, e.g, in |
|
The read is failed, it is a unit test in Gluten, before native write, it can pass |
Can you query this data that is written by Gluten from Spark? |
|
Yes, Gluten without native write can query. |
Ok, what I mean is using C++ insert data, and this will generate a data file. Then are you able to query this data file from Java (Spark). |
3643d15 to
689f190
Compare
|
@Yuhta can you help review again? Thank you very much in advance. |
689f190 to
94a0fa2
Compare
94a0fa2 to
15bcb8f
Compare
|
The code change in Prestissimo will not be able get merged until velox PR is merged. To prevent the build break in other CI pipelines revert the cmake target name change in hive/iceberg/CMakeLists.txt first. |
|
Hi, @Yuhta @mbasmanova Could you help review this PR? This PR has been integrated with Gluten and Presto, and pass several unit tests back port from Apache Iceberg, apache/incubator-gluten#9397, I fallback the partition table because this PR only supports identity partitioning and lacks for metadata which causes some tests failed. The following prs will support all partition transforms, functions and metadata, after that, I will enable all the test. Willing to see your reply, much thanks! |
15bcb8f to
0833b10
Compare
Co-authored-by [email protected]
0833b10 to
5882b63
Compare
|
Could you split to unpartitioned table support only? |
Thank you, this is possible. And I'm also thinking to separate the partition name and commit message. |
|
@mbasmanova @jinchengchenghh Thank you very much for the review comments, I split the PR into multiple smaller PRs. The first one is #14723. In following PRs, I will add
|
Summary: As per review request from #10996, we should split #10996 to multiple smaller PRs. This is the first PR of them. It adds support for basic insertion without partition transforms or data file statistics. The implementation supports both primitive data types and nested types, including struct, map, and array. A series of follow-up PRs will extend this work to make Iceberg write fully compatible with the Iceberg specification. 1. Iceberg partition spec (new files). 2. Add new option for Timestamp::tmToStringView to meet Iceberg spec (completely standalone code). 3. Customise iceberg writer options. 4. Iceberg file name generator. 5. Iceberg partition ID generator (new files). 6. Identity partition transform. Pull Request resolved: #14723 Reviewed By: xiaoxmeng Differential Revision: D83667667 Pulled By: kgpai fbshipit-source-id: a3df1aeac432d6dde6610b9a7057e170a706de9e
This PR implements support for Iceberg table insertions in Velox, enabling write operations to Iceberg tables through the Hive connector.
Changes
IcebergDataSinkclass that extendsHiveDataSinkto handle Iceberg-specific write operations.IcebergInsertTableHandleto manage Iceberg table insertion metadata.IcebergPartitionField,IcebergPartitionSpecto support Iceberg's partition transform specification, this PR only supports identity transform for now.Design Doc
Implementing_Iceberg_Insertion_Design.md
Implementation Details
The implementation follows Iceberg's table format specification, particularly for handling partitioning and metadata. Key components include:
The PR also includes test infrastructure for validating Iceberg insertions with various partition strategies.
Testing
Added unit tests that verify:
Limitation
This PR only support iceberg identity partition transform. And it only support primitive column type as partition column, nested column type such as struct is not supported.
All tests pass on the current codebase.