Skip to content

Conversation

@szlta
Copy link
Contributor

@szlta szlta commented May 25, 2022

We have two different ways of generating the partition field column name:
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java#L460 generates data_bucket_16, while
https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L484 generates data_bucket.

This PR tries to unify this, so that the former method would be applicable everywhere.
Related PR: 4662

szlta added 2 commits May 25, 2022 15:32
Change-Id: I28e2981bcd560db178a374f26650374699258681
Change-Id: Idab3b411daeb7c69a551a93eac7a7bff5ed5d78a
Change-Id: Ia8e451246567de4f33336019b1ca21e7fe13cc5a
@szlta szlta force-pushed the uniformPartitionFieldNames branch from 8e260db to fbeea09 Compare May 25, 2022 21:38
}

default T alwaysNull(String sourceName, int sourceId) {
throw new UnsupportedOperationException("Void transform is not supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Copy paste error? This is not the Void transform. Should be Always null transform is not supoprted (if needed at all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope this was meant to be intentional. Each method here has a pair that doesn't take a int fieldId, and this was missing for the alwaysNull method.
I believe this is actually the void transform, which other method should it be if not this one?

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @szlta.

I'm still ramping up on previous discussion etc, but have we identified actual cases (e.g. when using an engine) that one might get xxxx_bucket_n AND xxx_bucket? It's possible that there is a function that exists but isn't really used in a user-facing way (by which I mean used by an engine, preferably amongst the best supported / tested engines).

My immediate concern is that if an engine is currently generating columns in one style, and begins to generate them in another, this could be a very breaking change for people's tables.

Additionally, if we append the number of buckets / truncation width at the end of the generated column name, are we opening ourselves up to the possibility that users will be able to partition twice on the same column using the same transform (e.g. bucket_data_6 and bucket_data_8 within the same partition spec)?

I also have some minor style notes, but more importantly I'd like to discuss:

  • what we see in practice (as opposed to just reading through the code or even in unit tests, some of which might be somewhat old).
  • any places we might be depending on the naming at present that would cause a breaking change for people who use Iceberg in the most common ways (i.e. via Spark, Flink, Trino, Dremio, etc).

Is there a discussion somewhere of that or could you possibly provide an example so we're all on the same page?

Change-Id: I6033818787f20a46d79ca9c591b87a4710d5cd7a
@szlta
Copy link
Contributor Author

szlta commented May 26, 2022

Hi @kbendick , thanks a lot for taking a look.

Yes you're right, this issue does need a discussion. Originally I observed this in #4662 (comment) where one unit test kept on failing and shed light on this discrepancy within Iceberg code in how PartitionFIeld names are generated.

My immediate concern is that if an engine is currently generating columns in one style, and begins to generate them in another, this could be a very breaking change for people's tables.

I think in practice we were already using the version where transform arguments were appended to the partition field names. This is done by BaseUpdatePartitionSpec.java. The other method of generating the default names in PartitionSpec$Builder is mostly used by test code only with the one exception of Spark3Util. Well in Iceberg code right.. .. this is a public class with public methods in iceberg-api, so we will want to be careful with it, e.g. Hive calls this too.

Additionally, if we append the number of buckets / truncation width at the end of the generated column name, are we opening ourselves up to the possibility that users will be able to partition twice on the same column using the same transform (e.g. bucket_data_6 and bucket_data_8 within the same partition spec)?

Is there a limitation to do this? I think the API already allows doing this by specifying the target name. My change here alters the default naming convention. I think it is a possible scenario that data_bucket_2 and data_bucket_4 are both part of the spec as the structure will be a nested one. It's probably not useful for the majority of scenarios but nevertheless the option is there.

About your question on how the integrating engines would be affected - I'd like contributors for each engine to chime in. I can only speak for Hive. In Hive we currently generate the "arg-less" version of the partition name during table creation. On the other hand if we alter the spec later then the "arg-ful" version will kick in.
E.g.:

create external table ptest (a string, b int) partitioned by spec (bucket(16, a)) stored by iceberg;
alter table ptest set partition spec (bucket(16, a), bucket(2, b));
describe default.ptest.partitions;
+---------------+--------------------------------------+----------+
|   col_name    |              data_type               | comment  |
+---------------+--------------------------------------+----------+
| partition     | struct<a_bucket:int,b_bucket_2:int>  |          |
| record_count  | bigint                               |          |
| file_count    | int                                  |          |
| spec_id       | int                                  |          |
+---------------+--------------------------------------+----------+

doesn't look too nice, does it? But I think we're still early enough to unify these names in Hive.

Perhaps we could also annotate the current PartitionSpec$Builder as deprecated and create a new version of it, that will already be the unified implementation, and hook unit tests, and everything else within Iceberg codebase to that. Then we can give 1-2 release worth of time before we remove the original implementation, so at least we won't cause an immediate backward incompatibility problem.

Do let me know your thoughts.

@rdblue
Copy link
Contributor

rdblue commented May 26, 2022

I don't see much benefit to making this change. What is the underlying problem you're trying to solve? Changing 40 files just to make a slight change to field names in metadata tables doesn't seem worth it.

@szlta
Copy link
Contributor Author

szlta commented May 31, 2022

Thanks for taking a look @rdblue.

This is just a follow-up on the issue I found during implementing #4662. There I had a failing test which I could have just amended but I thought it was probably worth taking a deeper look and do some cleaning up.
We currently have two separate naming conventions for partition fields which I think is not only a technical debt, but could also be found confusing.
If you consider the following example:

    PartitionSpec initialSpec = PartitionSpec.builderFor(SCHEMA).bucket("data", 8).build();
    TestTables.TestTable table = TestTables.create(tableDir, "testnames", SCHEMA, initialSpec, 2);
    table.updateSpec().removeField(bucket("data", 8)).commit();
    table.updateSpec().addField(bucket("data", 8)).commit();
    Partitioning.partitionType(table);

We'd end up with

struct<1000: data_bucket: optional int, 1001: data_bucket_8: optional int>

I guess that means that for metadata queries one should specify different column names while they actually mean the same thing? I recall you have also found it weird in a similar case in @aokolnychyi 's example #3411 (comment) where field names didn't match partition names.

On the other hand you're right of course, as this clean-up needs change in lots of files. These are mostly test files, but I'd rather be worried about the expected API compatibility / behaviour of PartitionSpec class. But then again, maybe this is something that could be fixed before the first major version is GA?

@szlta
Copy link
Contributor Author

szlta commented Jun 9, 2022

Hi @rdblue, @kbendick - could you share your opinion on the above please?

@rdblue
Copy link
Contributor

rdblue commented Jun 9, 2022

@szlta, I appreciate your thoroughness and that you're aiming to make this less confusing overall. I think this is probably better off the way it is. I don't think that it is very confusing to have slightly different names generated in different situations and I don't think this large of a change is worth it.

One thing that I do think warrants attention is the reuse of old partition fields. If there's a partition spec with an equivalent field, then we should bring it back rather than replacing it. That helps quite a bit more than more uniform names, I think.

@szlta
Copy link
Contributor Author

szlta commented Jun 10, 2022

@rdblue I can accept this reasoning. I just hope we won't have to do this change in the future either, but thanks for sharing your insights.
I have amended my original change at #4662 so that this name generation discrepancy would not cause a test failure there while reusing old partition fields.. Kindly take a look on that PR, while I'll go ahead and close this. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants