-
Notifications
You must be signed in to change notification settings - Fork 373
[AMORO-3796] Fixed UUID type in Iceberg tables with partitions & buckets #3797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
implement automatic replacement of UUID type columns with the FixedType[16]
xxubai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this issue.
The UUID records in the data file are persisted as binary. If I understand correctly, the reader cannot directly cast binary to UUID.
You converted the UUID type to a fixed-length 16-byte binary in the schema, which makes sense. But I wonder if there is a more elegant way to fix this, ideally without intrusive changes to the writer code.
|
Hello! Thanks for your answer; I appreciate it a lot! About the issues you raised: I was trying to find any other part of the code that can be fixed to resolve this issue; however, it is directly connected to the original Iceberg library, for example: It is not fixed yet, but in my opinion, it is necessary to give users the ability to have UUID as a partition key even now because in some way it can be more optimised than other types. Additionally, we can see a similar issue with the UUID in the libraries that use Iceberg:
I am open to any other possibility to fix this issue in the Amoro code; however, for now it seems like this dirty way of casting classes is the single possibility to resolve the issue on the Amoro code side |
|
Hi, maybe we should try to convert the byte array to UUID in https://github.com/apache/iceberg/blob/main/data/src/main/java/org/apache/iceberg/data/InternalRecordWrapper.java#L50. |
|
@zhoujinsong Hello! Yep, it is necessary to fix it in the However, it will be great if there exists a process that will keep my changes at least while a new Iceberg version with corrected code will be included in the |
Hi @CatOrLeader . Thanks for the contribution! I directly tried modifying InternalRecordWrapper to add the corresponding converter in order to fix this issue (https://github.com/apache/iceberg/pull/14208/files#diff-3be4a9b849bc3dad71d9224f6d8f1b70ac5d6e36db9a1e608273c5df3e127a7c). I think we could follow @zhoujinsong 's suggestion and create a copy class inside Amoro to replace the current InternalRecordWrapper(Such as |
|
@xxubai Hi! I agree with you and @zhoujinsong that this method is more elegant and should be applied. Thank you for the expertise! I will reflect the conclusions in the code and tag you again when it will be ready and tested in out environment. Is it ok? Also, thank you for contributing to the |
add more elegant way to handle the issue with the Iceberg tables and UUID type
add a little change to `converter` method to always return an instance of `IcebergRecordWrapper` class
xxubai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3797 +/- ##
============================================
+ Coverage 21.96% 21.99% +0.02%
- Complexity 2431 2442 +11
============================================
Files 441 442 +1
Lines 40790 40826 +36
Branches 5756 5758 +2
============================================
+ Hits 8961 8981 +20
- Misses 31078 31092 +14
- Partials 751 753 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for the work! @CatOrLeader |
implement automatic replacement of UUID type columns with the FixedType[16]
Why are the changes needed?
There is an issue with Amoro's attempt to optimise tables in Iceberg when using partitions or buckets based on a UUID key. These changes FIX #3796 and do not corrupt any data
Brief change log
How was this patch tested?
The patch was tested manually in the production environment of the company I currently work in – the new. jar file of amoro-format-iceberg (with the changes) replace the original one
Optimisers work correctly, and table files are optimised successfully.
Documentation