-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core, API: Add getting refs and snapshot by ref to the Table API #4428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -305,4 +305,25 @@ default AppendFiles newFastAppend() { | |
|
|
||
| /** Returns a {@link LocationProvider} to provide locations for new data files. */ | ||
| LocationProvider locationProvider(); | ||
|
|
||
| /** | ||
| * Returns the current refs for the table | ||
| * | ||
| * @return the current refs for the table | ||
| */ | ||
| Map<String, SnapshotRef> refs(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this also have a default that creates a map of "main" to the current snapshot ID?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's one way to go, I've been thinking it would actually be simpler if when parsing the metadata, if a main branch does not exist, set it to the current snapshot. That way a lot of the API logic can rely on this assumption that main exists and we don't need a lot of special code for the case where the new Iceberg library is reading an older metadata file where refs (and thus main) may not exist. This PR would be blocked on a PR for doing that so I will raise that.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that sounds reasonable as well. |
||
|
|
||
| /** | ||
| * Returns the snapshot referenced by the given name or null if no such reference exists. | ||
| * | ||
| * @return the snapshot which is referenced by the given name or null if no such reference exists. | ||
| */ | ||
| default Snapshot snapshot(String name) { | ||
amogh-jahagirdar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| SnapshotRef ref = refs().get(name); | ||
| if (ref != null) { | ||
| return snapshot(ref.snapshotId()); | ||
| } | ||
|
|
||
| return null; | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to expose
SnapshotRefin theTableAPI? Why not return Snapshot for ref names?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it made sense to expose Refs in the Table API because refs are maintained per table and any operation which leverages the table API can easily access them.
I think refs() is helpful mostly as a convenience method to list all the refs in a table (similar being able to list the table properties, or the schemas/partition specs etc).
For the cases where a user of the table API knows what ref they are looking for then the ref(ref) becomes helpful. If we don't return Snapshot for a given ref name, then a caller has to do
table.snapshot(table.refs(name).snashotId())and they have to also take care of the null check in case a ref with name does not exist.
So my conclusion is the following:
1.) Expose the SnapshotRef in the Table API should be fine, because refs are maintained at the table level and an API expose to them either as a collection or a convenience method to look them up by name fits the model, and could be used.
2.) It will be common to want the snapshot for a given ref, and it also makes sense to have an API for returning a Snapshot for a given ref name.
Alternative:
The tradeoff of the above is the API is less minimal. For the purpose of the table scan we will ultimately need the snapshot for a given ref. So if we want to start minimal, what we can do is add the
Snapshot snapshot(String ref)signature and only when the refs(), ref(String name) are truly needed we add those.@rdblue @jackye1995 @singhpk234 Let me know your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'd probably add the
snapshot(String name)method and put off adding methods toTableuntil we know that we will definitely need them.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, one thing I forgot was for SerializableTable, we'll need to pass the refs through https://github.com/apache/iceberg/pull/4428/files#diff-46dafb425240806166ccc8f27a6c301781c5082bf1a187291cc92c2a8f17a588R85 so we'll need Map<String, SnapshotRef> refs() to implement the contract of snapshot(String name) in SerializableTable.
So then we need both refs() and snapshot(String name).
I think can leave off the ref(String name) for now, until we know that convenience method is really useful; I anticipate it will become useful mostly just to avoid another level of indirection when looking up from the map, but we can aim for just keeping the API changes minimal for now.