Skip to content

Conversation

@emkornfield
Copy link
Contributor

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Nov 3, 2023
emkornfield and others added 2 commits November 6, 2023 13:37
Co-authored-by: Eduard Tudenhoefner <[email protected]>
Co-authored-by: Eduard Tudenhoefner <[email protected]>
@emkornfield emkornfield requested a review from nastra November 6, 2023 21:38
@emkornfield emkornfield changed the title Clarify time travel implementation in Iceberg Spec: Clarify time travel implementation in Iceberg Nov 7, 2023

### Point in Time Reads (Time Travel)

Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is exactly meant by two histories might not be consistent for determining a table state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to clarify. please let me know if this reads better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping @aokolnychyi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discrepancies can be caused by a variety of table operations (e.g. updating the current-snapshot-id of the table).

might be better to further clarify the discrepancy. this wasn't super clear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more lanugage here, please let me know if it addresses the concerns.

@emkornfield
Copy link
Contributor Author

@aokolnychyi did my changes address your feedback properly?

@emkornfield
Copy link
Contributor Author

@aokolnychyi happy new year! Would you mind taking a second look?

@emkornfield
Copy link
Contributor Author

@Fokko @aokolnychyi would you mind taking a look?

@emkornfield
Copy link
Contributor Author

@Fokko @aokolnychyi just wanted to ping again to see if you have bandwidth to take another look?

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one minor comment, but I think it would be good to get this in

Co-authored-by: Fokko Driesprong <[email protected]>
format/spec.md Outdated
Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories
might indicate different snapshot IDs for a specific timestamp. The discrepancies can be caused by a variety of table operations (e.g. updating the `current-snapshot-id` of the table).

When processing point in time queries the Iceberg community has chosen to use "snapshot-log" metadata to lookup the table state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe For point in time queries, "snapshot-log" metadata are used to lookup the table state at the given point of time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we would say "the iceberg community has chosen." If we write it into the spec that's redundant. Also this isn't an implementation note though if we are telling folks to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Russell, but I would not block this clarification from going in. @RussellSpitzer, is this a blocker for you? If so could you note it on the vote thread? I'm going to vote +1, but we should wait until you're satisfied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rephrase when I move it under snapshots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing "the Iceberg community has chosen to", the spec should not involve the community, we can just describe what should be done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield : my comment above is not about the retrieval of the JSON metadata itself (which must be supported), but about the (optional) snapshot-log field in the JSON payload returned by a REST Catalog server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec is independent of the REST catalog protocol. The protocol covers how to exchange table information covered by this spec, but this spec covers how to track that information and, in this case, recommendations for how that information is used. In the context of your suggestion, this assumes that "when the catalog makes the snapshot history available in the metadata JSON" is all the time. It is always true because this spec defines the metadata JSON.

I think it is valuable to say that engines are encouraged to use the information from snapshot-log for time travel by timestamp so that the results match what a query would have seen at that time. We made that choice for Spark because we think that is what users expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshot-log is optional in both v1 and v2 spec: https://iceberg.apache.org/spec/#table-metadata-fields

Given that this PR suggests how engines should process that information so that users can expect similar behaviour from different engines, it would be nice to provide a suggestion of how to deal with missing snapshot-log in a common way too. In other words, how to produce a user-friendly message as opposed to some internal and engine-specific error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sentence mentioning an informative message being raised to the user about missing metadata. I also tried to rephrased to remove the the iceberg community. If further word-smithing is desired, suggestions are welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @emkornfield - current text LGTM 👍


Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements.

## Appendix F: Implementation Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit hesitant on adding an "implementation notes" section to the spec.

I am wondering if time travel clarification should be added to the snapshots section of the spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. I don't think that the spec mandates this time travel behavior, but it is what implementations should do to be consistent with one another. Seems like "Implementation notes" is a good place for this kind of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu @rdblue @ajantha-bhat not sure if either of you have more strongly held preference on this, happy to do it either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call it Implementation Suggestions, or Suggested Behaviors?

I'm fine with it being a separate section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see Ryan's point. I can withdraw my argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why can't we mandate this in spec? because it is optional field? We can use "if exists" while mandating in the spec?

Also, we have implementation suggetions like "File System Tables" and "Metastore tables" in the spec. So, can we have a "Time Travel" section for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat I think the two sections you described above are required for writers/readers for proper functionality. As per above this is not required for correctness but allows for consistency on implementation.

Copy link
Member

@ajantha-bhat ajantha-bhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts/concerns in using the optional field ("snapshot-log") as source of truth for timetravel instead of the available "snapshots" field.

@rdblue rdblue merged commit b769391 into apache:main Jul 23, 2024
@rdblue
Copy link
Contributor

rdblue commented Jul 23, 2024

Looks like we have reached consensus, so I'll go ahead and merge this. Thanks for the addition, @emkornfield!

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
Co-authored-by: Eduard Tudenhoefner <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Co-authored-by: Eduard Tudenhoefner <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.