-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spec: Clarify time travel implementation in Iceberg #8982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7090204
57bf503
e1501fe
ef571ed
6a631c9
71703e7
fedfdc6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1370,3 +1370,14 @@ Writing v2 metadata: | |
| * `sort_columns` was removed | ||
|
|
||
| Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. | ||
|
|
||
| ## Appendix F: Implementation Notes | ||
|
|
||
| This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification to help maintain a uniform experience. | ||
|
|
||
| ### Point in Time Reads (Time Travel) | ||
|
|
||
| Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is exactly meant by
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tried to clarify. please let me know if this reads better.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gentle ping @aokolnychyi
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
might be better to further clarify the discrepancy. this wasn't super clear to me.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added more lanugage here, please let me know if it addresses the concerns. |
||
| might indicate different snapshot IDs for a specific timestamp. The discrepancies can be caused by a variety of table operations (e.g. updating the `current-snapshot-id` can be used to set the snapshot of a table to any arbitrary snapshot, which might have a lineage derived from a table branch or no lineage at all). | ||
|
|
||
| When processing point in time queries implementations should use "snapshot-log" metadata to lookup the table state at the given point in time. This ensures time-travel queries reflect the state of the table at the provided timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the metadata from that snapshot to perform the scan of the table. If no snapshot exists prior to the timestamp given or "snapshot-log" is not populated (it is an optional field), then systems should raise an informative error message about the missing metadata. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit hesitant on adding an "implementation notes" section to the spec.
I am wondering if time travel clarification should be added to the
snapshotssection of the spec?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach. I don't think that the spec mandates this time travel behavior, but it is what implementations should do to be consistent with one another. Seems like "Implementation notes" is a good place for this kind of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenzwu @rdblue @ajantha-bhat not sure if either of you have more strongly held preference on this, happy to do it either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should call it Implementation Suggestions, or Suggested Behaviors?
I'm fine with it being a separate section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see Ryan's point. I can withdraw my argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why can't we mandate this in spec? because it is optional field? We can use "if exists" while mandating in the spec?
Also, we have implementation suggetions like "File System Tables" and "Metastore tables" in the spec. So, can we have a "Time Travel" section for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajantha-bhat I think the two sections you described above are required for writers/readers for proper functionality. As per above this is not required for correctness but allows for consistency on implementation.