-
Notifications
You must be signed in to change notification settings - Fork 3k
Spec: Add snapshot tagging and branching #3425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -566,6 +566,38 @@ Notes: | |
| 1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan. | ||
| 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected. | ||
|
|
||
| #### Snapshot Reference | ||
|
|
||
| Iceberg tables keep track of branches and tags using snapshot references. | ||
| Tags are labels for individual snapshots. Branches are mutable named references that can be updated by committing a new snapshot as the branch's referenced snapshot using the [Commit Conflict Resolution and Retry](#commit-conflict-resolution-and-retry) procedures. | ||
|
|
||
| The snapshot reference object records all the information of a reference including snapshot ID, reference type and [Snapshot Retention Policy](#snapshot-retention-policy). | ||
|
|
||
| | v1 | v2 | Field name | Type | Description | | ||
| | ---------- | ---------- | ---------------------------- | --------- | ----------- | | ||
| | _required_ | _required_ | **`snapshot-id`** | `long` | A reference's snapshot ID. The tagged snapshot or latest snapshot of a branch. | | ||
| | _required_ | _required_ | **`type`** | `string` | Type of the reference, `tag` or `branch` | | ||
| | _optional_ | _optional_ | **`min-snapshots-to-keep`** | `int` | For `branch` type only, a positive number for the minimum number of snapshots to keep in a branch while expiring snapshots. Defaults to table property `history.expire.min-snapshots-to-keep`. | | ||
| | _optional_ | _optional_ | **`max-snapshot-age-ms`** | `long` | For `branch` type only, a positive number for the max age of snapshots to keep when expiring, including the latest snapshot. Defaults to table property `history.expire.max-snapshot-age-ms`. | | ||
| | _optional_ | _optional_ | **`max-ref-age-ms`** | `long` | For snapshot references except the `main` branch, a positive number for the max age of the snapshot reference to keep while expiring snapshots. Defaults to table property `history.expire.max-ref-age-ms`. The `main` branch never expires. | | ||
|
|
||
| Valid snapshot references are stored as the values of the `refs` map in table metadata. For serialization, see Appendix C. | ||
|
|
||
| #### Snapshot Retention Policy | ||
|
|
||
| Table snapshots expire and are removed from metadata to allow removed or replaced data files to be physically deleted. | ||
| The snapshot expiration procedure removes snapshots from table metadata and applies the table's retention policy. | ||
| Retention policy can be configured both globally and on snapshot reference through properties `min-snapshots-to-keep`, `max-snapshot-age-ms` and `max-ref-age-ms`. | ||
|
|
||
| When expiring snapshots, retention policies in table and snapshot references are evaluated in the following way: | ||
|
|
||
| 1. Start with an empty set of snapshots to retain | ||
| 2. Remove any refs (other than main) where the referenced snapshot is older than `max-ref-age-ms` | ||
| 3. For each branch and tag, add the referenced snapshot to the retained set | ||
| 4. For each branch, add its ancestors to the retained set until: | ||
| 1. The snapshot is older than `max-snapshot-age-ms`, AND | ||
| 2. The snapshot is not one of the first `min-snapshots-to-keep` in the branch (including the branch's referenced snapshot) | ||
| 5. Expire any snapshot not in the set of snapshots to retain. | ||
jackye1995 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Table Metadata | ||
|
|
||
|
|
@@ -593,12 +625,13 @@ Table metadata consists of the following fields: | |
| | _optional_ | _required_ | **`default-spec-id`**| ID of the "current" spec that writers should use by default. | | ||
| | _optional_ | _required_ | **`last-partition-id`**| An integer; the highest assigned partition field ID across all partition specs for the table. This is used to ensure partition fields are always assigned an unused ID when evolving specs. | | ||
| | _optional_ | _optional_ | **`properties`**| A string to string map of table properties. This is used to control settings that affect reading and writing and is not intended to be used for arbitrary metadata. For example, `commit.retry.num-retries` is used to control the number of commit retries. | | ||
| | _optional_ | _optional_ | **`current-snapshot-id`**| `long` ID of the current table snapshot. | | ||
| | _optional_ | _optional_ | **`current-snapshot-id`**| `long` ID of the current table snapshot; must be the same as the current ID of the `main` branch in `refs`. | | ||
| | _optional_ | _optional_ | **`snapshots`**| A list of valid snapshots. Valid snapshots are snapshots for which all data files exist in the file system. A data file must not be deleted from the file system until the last snapshot in which it was listed is garbage collected. | | ||
| | _optional_ | _optional_ | **`snapshot-log`**| A list (optional) of timestamp and snapshot ID pairs that encodes changes to the current snapshot for the table. Each time the current-snapshot-id is changed, a new entry should be added with the last-updated-ms and the new current-snapshot-id. When snapshots are expired from the list of valid snapshots, all entries before a snapshot that has expired should be removed. | | ||
| | _optional_ | _optional_ | **`metadata-log`**| A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the table. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. | | ||
| | _optional_ | _required_ | **`sort-orders`**| A list of sort orders, stored as full sort order objects. | | ||
| | _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. | | ||
| | | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My concern raised on the google doc was that the act of re-writing The optimistic retry means that at transaction commit time things will unlikely conflict or cause issues. I think it just feels counterintuitive to me to do a re-write of the table metadata just to eg change the branch or add a tag. I also have concerns about extensibility in the future. For example extending to multiple tables, or implementing a One potential solution is to give the catalogs control over the implementation. For example the HMS catalog may store the proposed branch related data structures in the table definition in Hive rather in the metadata.json. This allows individual catalogs more latitude in how they deal with branching, and the ability to evolve that strategy over time. It does have the downside that migrating between catalogs could be more troublesome.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes.
I agree, this is a bit awkward. But it fits with the current model of keeping metadata.json as the source of truth for a table. I think we can evolve catalogs to do fancier things here -- for one, not sending all of the snapshots back to the client. But as far as the spec is concerned, how would we express that?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe one potential middle ground we can settle here is to first add the tagging feature without branching. That is a feature that many users would like to have for use cases like time-based snapshot expiration, and it makes sense to have that as a part of the Iceberg spec. For the branching feature, it was added to the design more for completeness, but so far (at least to me) there is no feature request for that yet. We can add it when the feature request comes and then discuss the best way to go. What do you think? @rdblue @rymurr
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should modify the proposal here. We know that lots of snapshots can be a problem, but that's how Iceberg manages snapshots. I don't think we need to hold off on building a feature like this for fairly rare cases. We can add branches and handle the metadata issues elsewhere. One solution I've been thinking about is allowing table implementations to request a specific time range or updates since some time in the REST catalog. That could handle this problem nicely.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. makes sense. In that case do you have any additional concern related to the spec change?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still feel uncomfortable about a transaction to add a tag but I don't see any easy way out of it. I think it would be good to have a discussion about metadata.json as the source of truth (for everything) in the longer term as I think that is becoming less feasible. My comment on the call today about notification settings living in metadata is related. I guess my only question is if we agree that its awkward and we agree that catalogs have more of a role to play in teh future then how do we move on from this proposal? Its hard to back this out of the spec or evolve it once its in. I know this is a hard question to reason about and I don't want to hold this useful feature up. But it would be good to at least think about it given the above discussion.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I think the work in TableMetadata builder and REST catalog is to try have an alternative way of dealing with table metadata updates from as a diff rather than rewriting the entire table metadata file. And that could open the door for potentially some other implementations of the TableMetadata spec backed by services and do not require full metadata rewrite. That is a development I would be happy to see, and with that we will be able to have fast commits to the metadata for small changes like tagging.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think we agree that it is awkward. But it's a separate point. We already allow branching to some degree, or at least a collection of snapshots that isn't necessarily linear. Being able to label some of those snapshots and change retention for them doesn't really alter the problem. @jackye1995 is right that we're getting to a point where we can possibly change this in the future. Moving to a change-based API is one step toward it, in addition to the use case of being able to more easily migrate library versions. |
||
|
|
||
| For serialization details, see Appendix C. | ||
|
|
||
|
|
@@ -1006,7 +1039,7 @@ Table metadata is serialized as a JSON object according to the following table. | |
| |**`metadata-log`**|`JSON list of objects: [`<br /> `{`<br /> `"metadata-file": ,`<br /> `"timestamp-ms": `<br /> `},`<br /> `...`<br />`]`|`[ {`<br /> `"metadata-file": "s3://bucket/.../v1.json",`<br /> `"timestamp-ms": 1515100...`<br />`} ]` | | ||
| |**`sort-orders`**|`JSON sort orders (list of sort field object)`|`See above`| | ||
| |**`default-sort-order-id`**|`JSON int`|`0`| | ||
|
|
||
| |**`refs`**|`JSON map with string key and object value:`<br />`{`<br /> `"<name>": {`<br /> `"snapshot-id": <id>,`<br /> `"type": <type>,`<br /> `"max-ref-age-ms": <long>,`<br /> `...`<br /> `}`<br /> `...`<br />`}`|`{`<br /> `"test": {`<br /> `"snapshot-id": 123456789000,`<br /> `"type": "tag",`<br /> `"max-ref-age-ms": 10000000`<br /> `}`<br />`}`| | ||
|
|
||
| ### Name Mapping Serialization | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on the google doc I would be in favour of splitting this into two structures. From the google doc:
To me this is coupling the concept of snapshot expiry to the branch/tag definition when we know we probably don't want to long term. Which could make it hard to make changes in the future. Maybe I am thinking about it too hard but to me there are two concepts: the list of branches/tags and the expiration policies. Having these as two separate data structures allows for both to grow independently in my mind: named policies, more complex policies etc as well as more flexibility in how the list of existing references is maintained in the presence of multi-table branching or catalog owned branching etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine either way. I think that the advantage here is that it is simple to express. Having a list of expiration policies and applying them selectively to tags or branches, possibly by identifying the policy by ID seems a bit overkill to me. I think you'd probably have some default branch/tag retention, retention for branch versions, and that's it. Even customizing that per branch seems a little speculative to me. I think people will likely just set expiration globally and forget about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mentioned this a bit in the design doc's reply, we can always add a reference pointer when there is a need to have a separated retention policy construct, so it's fully backwards compatible. For exmaple:
But for now having a separated retention policy object feels a bit overkill.