-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Document field-id requirements for partition specs #1141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
site/docs/spec.md
Outdated
|
|
||
| #### Partition Field ID handling | ||
|
|
||
| A partition field id is an integer (starting at 1000) used to identify a partition field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirement isn't to start at 1000, it is that the IDs should not conflict with the IDs required for the other fields in data files. I think starting at 1000 is a good recommendation to avoid that.
site/docs/spec.md
Outdated
|
|
||
| A partition field id is an integer (starting at 1000) used to identify a partition field. | ||
|
|
||
| Since iceberg release 0.8.0, partition fields are present in every partition field of partition specs in a table metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec shouldn't reference Iceberg Java releases. The requirements here are for different versions of tables:
- For v1 tables, partition field metadata should include a
field-idfor each partition field, but this is not required. - For v2 tables, partition field metadata must include a
field-idfor each partition field.
Then for compatibility within v1, this should recommend not removing fields but replacing them with always null. We also need to add that transform to the spec.
site/docs/spec.md
Outdated
|
|
||
| Additionally, in table metadata format v2, partition fields are required to have unique field IDs to support partition spec evolution. | ||
|
|
||
| For tables without partition field IDs, iceberg will generate an auto-increment unique field id starting at 1000 for every partition field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should state that old versions of the "reference implementation will generate..."
I think this should also give some context at the top about what breaks if compatibility recommendations are not followed for v1.
|
Thanks @rdblue and update the PR. |
|
@rdblue can you let me know your comments? Thanks. |
|
@rdblue should we merge this? |
site/docs/spec.md
Outdated
| | **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp(tz)` | `int` | | ||
| | **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp(tz)` | `date` | | ||
| | **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp(tz)` | `int` | | ||
| | **`alwaysNull`** | Always produces `null` (the void transform) | Any | `void` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transform name for the spec is void, not alwaysNull. That's just the name we use in Java where void is a reserved keyword.
Also, the result type is the source type, but values are always null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. Fixed.
site/docs/spec.md
Outdated
|
|
||
| In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec. | ||
|
|
||
| #### Partition Field ID Handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the appendix that covers JSON serialization for partition specs. It is good to document the field-id name here and note that it is required in v2 and optional in v1, but a lot of this content should be noted in the Partitioning section of the spec.
site/docs/spec.md
Outdated
|
|
||
| A partition field ID is an integer used to identify a partition field. | ||
| These IDs should not conflict with the IDs required for the other fields in data files. | ||
| To avoid that, iceberg assigns partition field IDs starting at 1000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirement is that IDs are unique and do not conflict with field IDs for columns in manifest files. The reference implementation assigns partition field IDs starting at 1,000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
site/docs/spec.md
Outdated
| * For v1 tables, partition field metadata should include a field id for each partition field, but this is not required. | ||
| * For v2 tables, partition field metadata must include a field id for each partition field. Partition field IDs are unique across partition specs to support the partition spec evolution for a given table. | ||
|
|
||
| To remove partition fields from the partition spec in an existing v1 table, it is recommended not removing fields but replacing their transforms with `alwaysNull`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in a secondary section about working with v1 tables. That should explain the problem: that field IDs were reused, which causes problems in metadata tables. To avoid the problem, don't reorder or delete partition fields. Instead, add new fields at the end and replace fields with void. Also note that this is not needed fro v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
365c7ec to
7ce502c
Compare
|
@rdblue thanks for the comments. I updated the document accordingly. Can you take another look? Thanks. |
site/docs/spec.md
Outdated
| | **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp(tz)` | `int` | | ||
| | **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp(tz)` | `date` | | ||
| | **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp(tz)` | `int` | | ||
| | **`void`** | Always produces `null` (the void transform) | Any | `null` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null isn't really a type. I understand wanting to make it clear what is happening, but I think "Source type" like identity is probably more accurate. It is clear that the function result is always null, though.
site/docs/spec.md
Outdated
|
|
||
| #### Partition Field ID handling | ||
|
|
||
| A partition field ID is an integer used to identify a partition field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "... identify a partition field in Iceberg manifest files".
site/docs/spec.md
Outdated
|
|
||
| About compatibility between v1 and v2 tables: | ||
|
|
||
| * For backward compatibility, if field ids are missing in a table metadata, iceberg will sequentially generate ids for each field starting at 1000 based on its position in the list of fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not refer to what "iceberg" does. There are things required by this spec and there are implementations. Both are "iceberg".
site/docs/spec.md
Outdated
| #### Partition Field ID | ||
|
|
||
| A partition field ID is an integer used to identify a partition field. | ||
| They are unique and do not conflict with field IDs for columns in manifest files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be "must not" because it is a requirement, not a guarantee.
site/docs/spec.md
Outdated
| * For backward compatibility, if field ids are missing in a table metadata, iceberg will sequentially generate ids for each field starting at 1000 based on its position in the list of fields. | ||
| * For forward compatibility, if field ids are not supported but present in the metadata, old versions of the reference implementation will ignore those field ids and then regenerate an auto-increment field id starting at 1000 for every partition field. | ||
|
|
||
| While working with a v1 table, field IDs might be reused if removing partition fields from its partition spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what this is trying to say is that old versions of the reference implementation did not keep track of field IDs. When creating a manifest, each field of the partition spec was assigned an ID starting at 1,000, and there were no guarantees about ID reuse across files. But as long as the spec was not evolved, IDs were consistent.
That has a few implications:
- Older writers may erase partition field IDs when writing to a v1 table. This does not happen to v2 tables because writers will fail to read or write a v2 table.
- Metadata tables need consistent field IDs across manifest files. To do this for v1 tables evolve the spec according to the recommendations (don't delete, replace with void; only add to the end; renames are okay).
|
@rdblue thanks for the review. Improved the doc accordingly. |
site/docs/spec.md
Outdated
| * For backward compatibility, if field ids are missing in a table metadata, the reference implementation will sequentially generate ids for each field starting at `1000` based on its position in the list of fields. | ||
| * For forward compatibility, if field ids are not supported but present in the metadata, old versions of the reference implementation will ignore those field ids and then regenerate an auto-increment field id starting at 1000 for every partition field. | ||
|
|
||
| While working with a v1 table, old versions of the reference implementation did not keep track of field IDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v1 spec did not require consistent field IDs, so they were assigned when creating each manifest file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue thanks for the comment. Updated it accordingly.
yyanyy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for continuing the efforts of updating this! Some minor comments on wording, maybe it's just me not being super familiar with this area
site/docs/spec.md
Outdated
| But as long as the partition spec will not be evolved, IDs will be consistent. | ||
|
|
||
| This has a few implications: | ||
| * Older writers may erase partition field IDs when writing to a v1 table. This does not happen to v2 tables because writers will fail to read or write a v2 table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "... will fail to read or write a v2 table when partition field ID doesn't exist"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. Updated it for clarification. It failed because of version mismatch. So old parsers will fail to read or write a v2 table metadata.
site/docs/spec.md
Outdated
| | **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp(tz)` | `int` | | ||
| | **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp(tz)` | `date` | | ||
| | **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp(tz)` | `int` | | ||
| | **`void`** | Always produces `null` (the void transform) | Any | Source type | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: might be good to have an explanation here on why this is needed (not likely needed in v2, mostly needed in v1 when dropping a partition field without changing the auto-assigned partition field ids?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is to describe all available partition transforms instead of explaining their use cases.
I added a new section Void Transform Details with some details.
site/docs/spec.md
Outdated
| This has a few implications: | ||
| * Older writers may erase partition field IDs when writing to a v1 table. This does not happen to v2 tables because writers will fail to read or write a v2 table. | ||
| * Metadata tables need consistent field IDs across manifest files. To achieve it, for v1 tables, please evolve the partition spec according to the recommendations, | ||
| i.e. don't reorder or delete partition fields; replace fields with with `void` transform; add new fields to the end. Note that renames are OK and also note that this does not apply for v2 tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: might want to clarify why it doesn't apply for v2 tables; it seems that it's not needed because v2 table is capable of auto identifying the latest partition field Id across manifests to always assign a new one to the new field during a partition spec update table commit? If that's the case, it might be good to mention it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it with the reason (i.e. v2 table tracks and persists the last assigned partition field id).
site/docs/spec.md
Outdated
|
|
||
| This has a few implications: | ||
| * Older writers may erase partition field IDs when writing to a v1 table. This does not happen to v2 tables because writers will fail to read or write a v2 table. | ||
| * Metadata tables need consistent field IDs across manifest files. To achieve it, for v1 tables, please evolve the partition spec according to the recommendations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I wonder if we want to expand a little bit on "Metadata tables need consistent field IDs across manifest files", which I think is to ensure correctness when looking up metadata tables, the same ID always refers to the same partition field/no ID reuses, and this mapping between ID and partition field is immutable even after partition spec evolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is correct. New section Void Transform Details now include those details.
39a3eff to
832a5c8
Compare
For #1025