Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions site/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Iceberg tables support table properties to configure table behavior, like the de
| commit.manifest-merge.enabled | true | Controls whether to automatically merge manifests on writes |
| history.expire.max-snapshot-age-ms | 432000000 (5 days) | Default max age of snapshots to keep while expiring snapshots |
| history.expire.min-snapshots-to-keep | 1 | Default min number of snapshots to keep while expiring snapshots |
| history.expire.max-ref-age-ms | `Long.MAX_VALUE` (forever) | For snapshot references except the `main` branch, default max age of snapshot references to keep while expiring snapshots. The `main` branch never expires. |

### Compatibility flags

Expand Down
37 changes: 35 additions & 2 deletions site/docs/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,38 @@ Notes:
1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan.
2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected.

#### Snapshot Reference

Iceberg tables keep track of branches and tags using snapshot references.
Tags are labels for individual snapshots. Branches are mutable named references that can be updated by committing a new snapshot as the branch's referenced snapshot using the [Commit Conflict Resolution and Retry](#commit-conflict-resolution-and-retry) procedures.

The snapshot reference object records all the information of a reference including snapshot ID, reference type and [Snapshot Retention Policy](#snapshot-retention-policy).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on the google doc I would be in favour of splitting this into two structures. From the google doc:

To me this is coupling the concept of snapshot expiry to the branch/tag definition when we know we probably don't want to long term. Which could make it hard to make changes in the future. Maybe I am thinking about it too hard but to me there are two concepts: the list of branches/tags and the expiration policies. Having these as two separate data structures allows for both to grow independently in my mind: named policies, more complex policies etc as well as more flexibility in how the list of existing references is maintained in the presence of multi-table branching or catalog owned branching etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine either way. I think that the advantage here is that it is simple to express. Having a list of expiration policies and applying them selectively to tags or branches, possibly by identifying the policy by ID seems a bit overkill to me. I think you'd probably have some default branch/tag retention, retention for branch versions, and that's it. Even customizing that per branch seems a little speculative to me. I think people will likely just set expiration globally and forget about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned this a bit in the design doc's reply, we can always add a reference pointer when there is a need to have a separated retention policy construct, so it's fully backwards compatible. For exmaple:

"branchA": {
  "max-ref-age-ms": 10000
},
"branchB": {
  "retention-policy-id": 2
}

But for now having a separated retention policy object feels a bit overkill.

| v1 | v2 | Field name | Type | Description |
| ---------- | ---------- | ---------------------------- | --------- | ----------- |
| _required_ | _required_ | **`snapshot-id`** | `long` | A reference's snapshot ID. The tagged snapshot or latest snapshot of a branch. |
| _required_ | _required_ | **`type`** | `string` | Type of the reference, `tag` or `branch` |
| _optional_ | _optional_ | **`min-snapshots-to-keep`** | `int` | For `branch` type only, a positive number for the minimum number of snapshots to keep in a branch while expiring snapshots. Defaults to table property `history.expire.min-snapshots-to-keep`. |
| _optional_ | _optional_ | **`max-snapshot-age-ms`** | `long` | For `branch` type only, a positive number for the max age of snapshots to keep when expiring, including the latest snapshot. Defaults to table property `history.expire.max-snapshot-age-ms`. |
| _optional_ | _optional_ | **`max-ref-age-ms`** | `long` | For snapshot references except the `main` branch, a positive number for the max age of the snapshot reference to keep while expiring snapshots. Defaults to table property `history.expire.max-ref-age-ms`. The `main` branch never expires. |

Valid snapshot references are stored as the values of the `refs` map in table metadata. For serialization, see Appendix C.

#### Snapshot Retention Policy

Table snapshots expire and are removed from metadata to allow removed or replaced data files to be physically deleted.
The snapshot expiration procedure removes snapshots from table metadata and applies the table's retention policy.
Retention policy can be configured both globally and on snapshot reference through properties `min-snapshots-to-keep`, `max-snapshot-age-ms` and `max-ref-age-ms`.

When expiring snapshots, retention policies in table and snapshot references are evaluated in the following way:

1. Start with an empty set of snapshots to retain
2. Remove any refs (other than main) where the referenced snapshot is older than `max-ref-age-ms`
3. For each branch and tag, add the referenced snapshot to the retained set
4. For each branch, add its ancestors to the retained set until:
1. The snapshot is older than `max-snapshot-age-ms`, AND
2. The snapshot is not one of the first `min-snapshots-to-keep` in the branch (including the branch's referenced snapshot)
5. Expire any snapshot not in the set of snapshots to retain.

### Table Metadata

Expand Down Expand Up @@ -593,12 +625,13 @@ Table metadata consists of the following fields:
| _optional_ | _required_ | **`default-spec-id`**| ID of the "current" spec that writers should use by default. |
| _optional_ | _required_ | **`last-partition-id`**| An integer; the highest assigned partition field ID across all partition specs for the table. This is used to ensure partition fields are always assigned an unused ID when evolving specs. |
| _optional_ | _optional_ | **`properties`**| A string to string map of table properties. This is used to control settings that affect reading and writing and is not intended to be used for arbitrary metadata. For example, `commit.retry.num-retries` is used to control the number of commit retries. |
| _optional_ | _optional_ | **`current-snapshot-id`**| `long` ID of the current table snapshot. |
| _optional_ | _optional_ | **`current-snapshot-id`**| `long` ID of the current table snapshot; must be the same as the current ID of the `main` branch in `refs`. |
| _optional_ | _optional_ | **`snapshots`**| A list of valid snapshots. Valid snapshots are snapshots for which all data files exist in the file system. A data file must not be deleted from the file system until the last snapshot in which it was listed is garbage collected. |
| _optional_ | _optional_ | **`snapshot-log`**| A list (optional) of timestamp and snapshot ID pairs that encodes changes to the current snapshot for the table. Each time the current-snapshot-id is changed, a new entry should be added with the last-updated-ms and the new current-snapshot-id. When snapshots are expired from the list of valid snapshots, all entries before a snapshot that has expired should be removed. |
| _optional_ | _optional_ | **`metadata-log`**| A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the table. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. |
| _optional_ | _required_ | **`sort-orders`**| A list of sort orders, stored as full sort order objects. |
| _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. |
| | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern raised on the google doc was that the act of re-writing metadata.json and doing a full transaction just to add or change a branch/tag feels strange. Now that current-branch is gone I am a bit less concerned. The only time where there is a transaction just to modify the refs field is when a branch or tag is created, is that right?

The optimistic retry means that at transaction commit time things will unlikely conflict or cause issues. I think it just feels counterintuitive to me to do a re-write of the table metadata just to eg change the branch or add a tag. I also have concerns about extensibility in the future. For example extending to multiple tables, or implementing a git log like feature.

One potential solution is to give the catalogs control over the implementation. For example the HMS catalog may store the proposed branch related data structures in the table definition in Hive rather in the metadata.json. This allows individual catalogs more latitude in how they deal with branching, and the ability to evolve that strategy over time. It does have the downside that migrating between catalogs could be more troublesome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only time where there is a transaction just to modify the refs field is when a branch or tag is created, is that right?

Yes.

feels counterintuitive to me to do a re-write of the table metadata just to eg change the branch or add a tag . . . One potential solution is to give the catalogs control over the implementation.

I agree, this is a bit awkward. But it fits with the current model of keeping metadata.json as the source of truth for a table. I think we can evolve catalogs to do fancier things here -- for one, not sending all of the snapshots back to the client. But as far as the spec is concerned, how would we express that?

Copy link
Contributor Author

@jackye1995 jackye1995 Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one potential middle ground we can settle here is to first add the tagging feature without branching. That is a feature that many users would like to have for use cases like time-based snapshot expiration, and it makes sense to have that as a part of the Iceberg spec. For the branching feature, it was added to the design more for completeness, but so far (at least to me) there is no feature request for that yet. We can add it when the feature request comes and then discuss the best way to go. What do you think? @rdblue @rymurr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should modify the proposal here. We know that lots of snapshots can be a problem, but that's how Iceberg manages snapshots. I don't think we need to hold off on building a feature like this for fairly rare cases. We can add branches and handle the metadata issues elsewhere.

One solution I've been thinking about is allowing table implementations to request a specific time range or updates since some time in the REST catalog. That could handle this problem nicely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. In that case do you have any additional concern related to the spec change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel uncomfortable about a transaction to add a tag but I don't see any easy way out of it. I think it would be good to have a discussion about metadata.json as the source of truth (for everything) in the longer term as I think that is becoming less feasible. My comment on the call today about notification settings living in metadata is related.

I guess my only question is if we agree that its awkward and we agree that catalogs have more of a role to play in teh future then how do we move on from this proposal? Its hard to back this out of the spec or evolve it once its in. I know this is a hard question to reason about and I don't want to hold this useful feature up. But it would be good to at least think about it given the above discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think the work in TableMetadata builder and REST catalog is to try have an alternative way of dealing with table metadata updates from as a diff rather than rewriting the entire table metadata file. And that could open the door for potentially some other implementations of the TableMetadata spec backed by services and do not require full metadata rewrite. That is a development I would be happy to see, and with that we will be able to have fast commits to the metadata for small changes like tagging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we agree that it is awkward. But it's a separate point. We already allow branching to some degree, or at least a collection of snapshots that isn't necessarily linear. Being able to label some of those snapshots and change retention for them doesn't really alter the problem.

@jackye1995 is right that we're getting to a point where we can possibly change this in the future. Moving to a change-based API is one step toward it, in addition to the use case of being able to more easily migrate library versions.


For serialization details, see Appendix C.

Expand Down Expand Up @@ -1006,7 +1039,7 @@ Table metadata is serialized as a JSON object according to the following table.
|**`metadata-log`**|`JSON list of objects: [`<br />&nbsp;&nbsp;`{`<br />&nbsp;&nbsp;`"metadata-file": ,`<br />&nbsp;&nbsp;`"timestamp-ms": `<br />&nbsp;&nbsp;`},`<br />&nbsp;&nbsp;`...`<br />`]`|`[ {`<br />&nbsp;&nbsp;`"metadata-file": "s3://bucket/.../v1.json",`<br />&nbsp;&nbsp;`"timestamp-ms": 1515100...`<br />`} ]` |
|**`sort-orders`**|`JSON sort orders (list of sort field object)`|`See above`|
|**`default-sort-order-id`**|`JSON int`|`0`|

|**`refs`**|`JSON map with string key and object value:`<br />`{`<br />&nbsp;&nbsp;`"<name>": {`<br />&nbsp;&nbsp;`"snapshot-id": <id>,`<br />&nbsp;&nbsp;`"type": <type>,`<br />&nbsp;&nbsp;`"max-ref-age-ms": <long>,`<br />&nbsp;&nbsp;`...`<br />&nbsp;&nbsp;`}`<br />&nbsp;&nbsp;`...`<br />`}`|`{`<br />&nbsp;&nbsp;`"test": {`<br />&nbsp;&nbsp;`"snapshot-id": 123456789000,`<br />&nbsp;&nbsp;`"type": "tag",`<br />&nbsp;&nbsp;`"max-ref-age-ms": 10000000`<br />&nbsp;&nbsp;`}`<br />`}`|

### Name Mapping Serialization

Expand Down