Skip to content
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -1370,3 +1370,16 @@ Writing v2 metadata:
* `sort_columns` was removed

Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements.

## Appendix F: Implementation Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit hesitant on adding an "implementation notes" section to the spec.

I am wondering if time travel clarification should be added to the snapshots section of the spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. I don't think that the spec mandates this time travel behavior, but it is what implementations should do to be consistent with one another. Seems like "Implementation notes" is a good place for this kind of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu @rdblue @ajantha-bhat not sure if either of you have more strongly held preference on this, happy to do it either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call it Implementation Suggestions, or Suggested Behaviors?

I'm fine with it being a separate section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see Ryan's point. I can withdraw my argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why can't we mandate this in spec? because it is optional field? We can use "if exists" while mandating in the spec?

Also, we have implementation suggetions like "File System Tables" and "Metastore tables" in the spec. So, can we have a "Time Travel" section for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat I think the two sections you described above are required for writers/readers for proper functionality. As per above this is not required for correctness but allows for consistency on implementation.


This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification
to help maintain a uniform experience.

### Point in Time Reads (Time Travel)

Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is exactly meant by two histories might not be consistent for determining a table state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to clarify. please let me know if this reads better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping @aokolnychyi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discrepancies can be caused by a variety of table operations (e.g. updating the current-snapshot-id of the table).

might be better to further clarify the discrepancy. this wasn't super clear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more lanugage here, please let me know if it addresses the concerns.

might indicate different snapshot IDs for a specific timestamp. The discrepencies can be caused by a variety of table operations (e.g. branch-merge table workflows or forcing the current state of table to specific snapshot ID).

When processing point in time queries the Iceberg community has chosen to use "snapshot-log" metadata to lookup the table state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe For point in time queries, "snapshot-log" metadata are used to lookup the table state at the given point of time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we would say "the iceberg community has chosen." If we write it into the spec that's redundant. Also this isn't an implementation note though if we are telling folks to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Russell, but I would not block this clarification from going in. @RussellSpitzer, is this a blocker for you? If so could you note it on the vote thread? I'm going to vote +1, but we should wait until you're satisfied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rephrase when I move it under snapshots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing "the Iceberg community has chosen to", the spec should not involve the community, we can just describe what should be done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield : my comment above is not about the retrieval of the JSON metadata itself (which must be supported), but about the (optional) snapshot-log field in the JSON payload returned by a REST Catalog server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec is independent of the REST catalog protocol. The protocol covers how to exchange table information covered by this spec, but this spec covers how to track that information and, in this case, recommendations for how that information is used. In the context of your suggestion, this assumes that "when the catalog makes the snapshot history available in the metadata JSON" is all the time. It is always true because this spec defines the metadata JSON.

I think it is valuable to say that engines are encouraged to use the information from snapshot-log for time travel by timestamp so that the results match what a query would have seen at that time. We made that choice for Spark because we think that is what users expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshot-log is optional in both v1 and v2 spec: https://iceberg.apache.org/spec/#table-metadata-fields

Given that this PR suggests how engines should process that information so that users can expect similar behaviour from different engines, it would be nice to provide a suggestion of how to deal with missing snapshot-log in a common way too. In other words, how to produce a user-friendly message as opposed to some internal and engine-specific error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sentence mentioning an informative message being raised to the user about missing metadata. I also tried to rephrased to remove the the iceberg community. If further word-smithing is desired, suggestions are welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @emkornfield - current text LGTM 👍

at the given point in time. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the metadata from that snapshot to perform the scan of the table.