Uniform partition field names generation #4868

szlta · 2022-05-25T14:52:45Z

We have two different ways of generating the partition field column name:
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java#L460 generates data_bucket_16, while
https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L484 generates data_bucket.

This PR tries to unify this, so that the former method would be applicable everywhere.
Related PR: 4662

Change-Id: I28e2981bcd560db178a374f26650374699258681

Change-Id: Idab3b411daeb7c69a551a93eac7a7bff5ed5d78a

Change-Id: Ia8e451246567de4f33336019b1ca21e7fe13cc5a

kbendick · 2022-05-26T00:18:45Z

api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java

  }

+  default T alwaysNull(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Void transform is not supported");


Nit: Copy paste error? This is not the Void transform. Should be Always null transform is not supoprted (if needed at all).

Nope this was meant to be intentional. Each method here has a pair that doesn't take a int fieldId, and this was missing for the alwaysNull method.
I believe this is actually the void transform, which other method should it be if not this one?

kbendick

Hi @szlta.

I'm still ramping up on previous discussion etc, but have we identified actual cases (e.g. when using an engine) that one might get xxxx_bucket_n AND xxx_bucket? It's possible that there is a function that exists but isn't really used in a user-facing way (by which I mean used by an engine, preferably amongst the best supported / tested engines).

My immediate concern is that if an engine is currently generating columns in one style, and begins to generate them in another, this could be a very breaking change for people's tables.

Additionally, if we append the number of buckets / truncation width at the end of the generated column name, are we opening ourselves up to the possibility that users will be able to partition twice on the same column using the same transform (e.g. bucket_data_6 and bucket_data_8 within the same partition spec)?

I also have some minor style notes, but more importantly I'd like to discuss:

what we see in practice (as opposed to just reading through the code or even in unit tests, some of which might be somewhat old).
any places we might be depending on the naming at present that would cause a breaking change for people who use Iceberg in the most common ways (i.e. via Spark, Flink, Trino, Dremio, etc).

Is there a discussion somewhere of that or could you possibly provide an example so we're all on the same page?

Change-Id: I6033818787f20a46d79ca9c591b87a4710d5cd7a

szlta · 2022-05-26T14:06:20Z

Hi @kbendick , thanks a lot for taking a look.

Yes you're right, this issue does need a discussion. Originally I observed this in #4662 (comment) where one unit test kept on failing and shed light on this discrepancy within Iceberg code in how PartitionFIeld names are generated.

My immediate concern is that if an engine is currently generating columns in one style, and begins to generate them in another, this could be a very breaking change for people's tables.

I think in practice we were already using the version where transform arguments were appended to the partition field names. This is done by BaseUpdatePartitionSpec.java. The other method of generating the default names in PartitionSpec$Builder is mostly used by test code only with the one exception of Spark3Util. Well in Iceberg code right.. .. this is a public class with public methods in iceberg-api, so we will want to be careful with it, e.g. Hive calls this too.

Additionally, if we append the number of buckets / truncation width at the end of the generated column name, are we opening ourselves up to the possibility that users will be able to partition twice on the same column using the same transform (e.g. bucket_data_6 and bucket_data_8 within the same partition spec)?

Is there a limitation to do this? I think the API already allows doing this by specifying the target name. My change here alters the default naming convention. I think it is a possible scenario that data_bucket_2 and data_bucket_4 are both part of the spec as the structure will be a nested one. It's probably not useful for the majority of scenarios but nevertheless the option is there.

About your question on how the integrating engines would be affected - I'd like contributors for each engine to chime in. I can only speak for Hive. In Hive we currently generate the "arg-less" version of the partition name during table creation. On the other hand if we alter the spec later then the "arg-ful" version will kick in.
E.g.:

create external table ptest (a string, b int) partitioned by spec (bucket(16, a)) stored by iceberg;
alter table ptest set partition spec (bucket(16, a), bucket(2, b));
describe default.ptest.partitions;
+---------------+--------------------------------------+----------+
|   col_name    |              data_type               | comment  |
+---------------+--------------------------------------+----------+
| partition     | struct<a_bucket:int,b_bucket_2:int>  |          |
| record_count  | bigint                               |          |
| file_count    | int                                  |          |
| spec_id       | int                                  |          |
+---------------+--------------------------------------+----------+

doesn't look too nice, does it? But I think we're still early enough to unify these names in Hive.

Perhaps we could also annotate the current PartitionSpec$Builder as deprecated and create a new version of it, that will already be the unified implementation, and hook unit tests, and everything else within Iceberg codebase to that. Then we can give 1-2 release worth of time before we remove the original implementation, so at least we won't cause an immediate backward incompatibility problem.

Do let me know your thoughts.

rdblue · 2022-05-26T16:39:47Z

I don't see much benefit to making this change. What is the underlying problem you're trying to solve? Changing 40 files just to make a slight change to field names in metadata tables doesn't seem worth it.

szlta · 2022-05-31T13:32:32Z

Thanks for taking a look @rdblue.

This is just a follow-up on the issue I found during implementing #4662. There I had a failing test which I could have just amended but I thought it was probably worth taking a deeper look and do some cleaning up.
We currently have two separate naming conventions for partition fields which I think is not only a technical debt, but could also be found confusing.
If you consider the following example:

    PartitionSpec initialSpec = PartitionSpec.builderFor(SCHEMA).bucket("data", 8).build();
    TestTables.TestTable table = TestTables.create(tableDir, "testnames", SCHEMA, initialSpec, 2);
    table.updateSpec().removeField(bucket("data", 8)).commit();
    table.updateSpec().addField(bucket("data", 8)).commit();
    Partitioning.partitionType(table);

We'd end up with

struct<1000: data_bucket: optional int, 1001: data_bucket_8: optional int>

I guess that means that for metadata queries one should specify different column names while they actually mean the same thing? I recall you have also found it weird in a similar case in @aokolnychyi 's example #3411 (comment) where field names didn't match partition names.

On the other hand you're right of course, as this clean-up needs change in lots of files. These are mostly test files, but I'd rather be worried about the expected API compatibility / behaviour of PartitionSpec class. But then again, maybe this is something that could be fixed before the first major version is GA?

szlta · 2022-06-09T19:21:01Z

Hi @rdblue, @kbendick - could you share your opinion on the above please?

rdblue · 2022-06-09T20:16:41Z

@szlta, I appreciate your thoroughness and that you're aiming to make this less confusing overall. I think this is probably better off the way it is. I don't think that it is very confusing to have slightly different names generated in different situations and I don't think this large of a change is worth it.

One thing that I do think warrants attention is the reuse of old partition fields. If there's a partition spec with an equivalent field, then we should bring it back rather than replacing it. That helps quite a bit more than more uniform names, I think.

szlta · 2022-06-10T15:06:28Z

@rdblue I can accept this reasoning. I just hope we won't have to do this change in the future either, but thanks for sharing your insights.
I have amended my original change at #4662 so that this name generation discrepancy would not cause a test failure there while reusing old partition fields.. Kindly take a look on that PR, while I'll go ahead and close this. Thanks!

szlta added 2 commits May 25, 2022 15:32

Uniform PartitionField name generation

aeae626

Change-Id: I28e2981bcd560db178a374f26650374699258681

test updates

5419e94

Change-Id: Idab3b411daeb7c69a551a93eac7a7bff5ed5d78a

github-actions bot added API core hive spark labels May 25, 2022

test updates 2

fbeea09

Change-Id: Ia8e451246567de4f33336019b1ca21e7fe13cc5a

szlta force-pushed the uniformPartitionFieldNames branch from 8e260db to fbeea09 Compare May 25, 2022 21:38

kbendick reviewed May 26, 2022

View reviewed changes

test updates 3

2dffce6

Change-Id: I6033818787f20a46d79ca9c591b87a4710d5cd7a

szlta closed this Jun 10, 2022

pvary mentioned this pull request Jun 15, 2022

Core: Metadata table queries fail if a partition column was reused in V2 #4662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uniform partition field names generation #4868

Uniform partition field names generation #4868

Uh oh!

szlta commented May 25, 2022

Uh oh!

kbendick May 26, 2022

Uh oh!

szlta May 26, 2022

Uh oh!

kbendick left a comment •

edited

Loading

Uh oh!

szlta commented May 26, 2022

Uh oh!

rdblue commented May 26, 2022

Uh oh!

szlta commented May 31, 2022

Uh oh!

szlta commented Jun 9, 2022

Uh oh!

rdblue commented Jun 9, 2022

Uh oh!

szlta commented Jun 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uniform partition field names generation #4868

Uniform partition field names generation #4868

Uh oh!

Conversation

szlta commented May 25, 2022

Uh oh!

kbendick May 26, 2022

Choose a reason for hiding this comment

Uh oh!

szlta May 26, 2022

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szlta commented May 26, 2022

Uh oh!

rdblue commented May 26, 2022

Uh oh!

szlta commented May 31, 2022

Uh oh!

szlta commented Jun 9, 2022

Uh oh!

rdblue commented Jun 9, 2022

Uh oh!

szlta commented Jun 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kbendick left a comment •

edited

Loading