-
Notifications
You must be signed in to change notification settings - Fork 3k
Uniform partition field names generation #4868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change-Id: I28e2981bcd560db178a374f26650374699258681
Change-Id: Idab3b411daeb7c69a551a93eac7a7bff5ed5d78a
Change-Id: Ia8e451246567de4f33336019b1ca21e7fe13cc5a
8e260db to
fbeea09
Compare
| } | ||
|
|
||
| default T alwaysNull(String sourceName, int sourceId) { | ||
| throw new UnsupportedOperationException("Void transform is not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Copy paste error? This is not the Void transform. Should be Always null transform is not supoprted (if needed at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope this was meant to be intentional. Each method here has a pair that doesn't take a int fieldId, and this was missing for the alwaysNull method.
I believe this is actually the void transform, which other method should it be if not this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @szlta.
I'm still ramping up on previous discussion etc, but have we identified actual cases (e.g. when using an engine) that one might get xxxx_bucket_n AND xxx_bucket? It's possible that there is a function that exists but isn't really used in a user-facing way (by which I mean used by an engine, preferably amongst the best supported / tested engines).
My immediate concern is that if an engine is currently generating columns in one style, and begins to generate them in another, this could be a very breaking change for people's tables.
Additionally, if we append the number of buckets / truncation width at the end of the generated column name, are we opening ourselves up to the possibility that users will be able to partition twice on the same column using the same transform (e.g. bucket_data_6 and bucket_data_8 within the same partition spec)?
I also have some minor style notes, but more importantly I'd like to discuss:
- what we see in practice (as opposed to just reading through the code or even in unit tests, some of which might be somewhat old).
- any places we might be depending on the naming at present that would cause a breaking change for people who use Iceberg in the most common ways (i.e. via Spark, Flink, Trino, Dremio, etc).
Is there a discussion somewhere of that or could you possibly provide an example so we're all on the same page?
Change-Id: I6033818787f20a46d79ca9c591b87a4710d5cd7a
|
Hi @kbendick , thanks a lot for taking a look. Yes you're right, this issue does need a discussion. Originally I observed this in #4662 (comment) where one unit test kept on failing and shed light on this discrepancy within Iceberg code in how PartitionFIeld names are generated.
I think in practice we were already using the version where transform arguments were appended to the partition field names. This is done by
Is there a limitation to do this? I think the API already allows doing this by specifying the target name. My change here alters the default naming convention. I think it is a possible scenario that data_bucket_2 and data_bucket_4 are both part of the spec as the structure will be a nested one. It's probably not useful for the majority of scenarios but nevertheless the option is there. About your question on how the integrating engines would be affected - I'd like contributors for each engine to chime in. I can only speak for Hive. In Hive we currently generate the "arg-less" version of the partition name during table creation. On the other hand if we alter the spec later then the "arg-ful" version will kick in. doesn't look too nice, does it? But I think we're still early enough to unify these names in Hive. Perhaps we could also annotate the current PartitionSpec$Builder as deprecated and create a new version of it, that will already be the unified implementation, and hook unit tests, and everything else within Iceberg codebase to that. Then we can give 1-2 release worth of time before we remove the original implementation, so at least we won't cause an immediate backward incompatibility problem. Do let me know your thoughts. |
|
I don't see much benefit to making this change. What is the underlying problem you're trying to solve? Changing 40 files just to make a slight change to field names in metadata tables doesn't seem worth it. |
|
Thanks for taking a look @rdblue. This is just a follow-up on the issue I found during implementing #4662. There I had a failing test which I could have just amended but I thought it was probably worth taking a deeper look and do some cleaning up. We'd end up with I guess that means that for metadata queries one should specify different column names while they actually mean the same thing? I recall you have also found it weird in a similar case in @aokolnychyi 's example #3411 (comment) where field names didn't match partition names. On the other hand you're right of course, as this clean-up needs change in lots of files. These are mostly test files, but I'd rather be worried about the expected API compatibility / behaviour of PartitionSpec class. But then again, maybe this is something that could be fixed before the first major version is GA? |
|
@szlta, I appreciate your thoroughness and that you're aiming to make this less confusing overall. I think this is probably better off the way it is. I don't think that it is very confusing to have slightly different names generated in different situations and I don't think this large of a change is worth it. One thing that I do think warrants attention is the reuse of old partition fields. If there's a partition spec with an equivalent field, then we should bring it back rather than replacing it. That helps quite a bit more than more uniform names, I think. |
|
@rdblue I can accept this reasoning. I just hope we won't have to do this change in the future either, but thanks for sharing your insights. |
We have two different ways of generating the partition field column name:
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java#L460 generates data_bucket_16, while
https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L484 generates data_bucket.
This PR tries to unify this, so that the former method would be applicable everywhere.
Related PR: 4662