Skip to content

Conversation

@szlta
Copy link
Contributor

@szlta szlta commented Apr 6, 2022

As per #4292, PartitionsTable should call Partitioning.partitionType() to gather the partition struct in its schema. Also the result should show specId too, as the partition key might not be considered a unique key on its own without specId in V2 format. (E.g a partition column removed then re-added along with another column..)

This is just the initial change so that I can see what tests would need fixing and to check with community if the new schema is okay like this:

  1. spec_id
  2. partition
  3. record_count
  4. file_count

I think in other meta tables the spec_id precedes the partition key in the output, so I prepended the existing schema here with it. I'm not sure if this counts as a backward-incompatible change though. What do you think @aokolnychyi ?

@github-actions github-actions bot added the core label Apr 6, 2022
@github-actions github-actions bot added the spark label Apr 7, 2022
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one doubt on this change, looks good otherwise.

Yea when adding the spec_id to files table (#3015) , @rdblue had mentioned its ok to insert new field into the middle near partition , as nobody should be depending on the field order, so I am guessing its ok in this table?

private int fileCount;

Partition(StructLike key) {
this.specId = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering one thing , didn't test it myself.

If we have two partitions with different spec id but different key, it will show up with a random one. Could this case happen (ie, adding and removing a field in partition spec should give a new spec id but can still produce the same partition value?)? If so, we may need to add spec-id to key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you meant "different spec id but same key"?
Normally if you remove a field from partition spec and then re-add it, you will end up with a former spec and Iceberg will not create a new spec, just to point to the older one. In some cases though, if you re-add the column along with some other columns (forming the new spec together), then you could end up with the problem described here: #3411 (comment). That's something we have to solve too, but in another PR, as the scope of that actually covers other metadata tables too that use Partitioning.partitionType()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yea you are right, seems Iceberg re-uses the spec id if we revert the spec to a former one.

Types.NestedField.required(1, "partition", table.spec().partitionType()),
Types.NestedField.required(2, "record_count", Types.LongType.get()),
Types.NestedField.required(3, "file_count", Types.IntegerType.get())
Types.NestedField.required(1, "spec_id", Types.IntegerType.get()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine to add a new field, but you should not change the IDs of the other fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in this case spec_id will be appended to the schema then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I thought it means, we just need to keep the id of the other fields, but could append in any position? ex,

 Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
 Types.NestedField.required(1, "partition", table.spec().partitionType()),
 Types.NestedField.required(2, "record_count", Types.LongType.get()),
 Types.NestedField.required(3, "file_count", Types.IntegerType.get())

Though I think its not a big deal, and we can put it in the end.

@szlta szlta force-pushed the specIdToPartitionsTable branch from 44cf239 to db6e8a0 Compare April 11, 2022 13:31
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks ok to me.

@szehon-ho szehon-ho merged commit b3a27c6 into apache:master Apr 12, 2022
@szehon-ho
Copy link
Member

Thanks @szlta for change and @rdblue for taking a look, let's work on the subsequent issue.

@szlta
Copy link
Contributor Author

szlta commented Apr 13, 2022

Thanks a lot for the review @szehon-ho and @rdblue

@szehon-ho
Copy link
Member

I tried to test these cases, while this fixes the PartitionsTable schema, some of the partition entries may have problems (as current spec is still used in a few other places). I made #4560 pr as a follow up (@szlta sorry I was not sure if you were working on going to work on this or something else).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants