-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core, API: Add getting refs and snapshot by ref to the Table API #4428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rdblue @jackye1995 Let me know what you think! |
|
Investigating failing core tests |
c75d13c to
0f37b2a
Compare
| * @return a new scan based on this with the given branch | ||
| * @throws IllegalArgumentException if the branch cannot be found | ||
| */ | ||
| TableScan branch(String branch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the distinction between branch/tag really matter here or could we just have a TableScan ref(String ref) method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. My thought was that from an API point of view people would be more familiar with the terms branch and tag, so it would be easier to have those methods. I also think if is using this, they know before hand if it's a branch or a tag, so it's not cumbersome.
But yeah currently, the logic is effectively the same which is to use the snapshot associated with the reference.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of just using TableScan ref(String ref) / TableScan reference(String ref) since usually people are familiar with the term reference when it comes to branches/tag, but let's see what others think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would useRef be better? Just to match useSnapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think there is a need to distinguish useBranch and useTag, because useBranch can be used in combiantion with asOfTime to perform time travel against a specific branch.
| * | ||
| * @return the ref with the given name or null if it does not exist. | ||
| */ | ||
| SnapshotRef ref(String ref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this method, if it's always refs().get(ref)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nvm we also have the method in TableMetadata, so probably better to have it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, my thinking was just to keep it the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's nice to provide the convenience method.
|
Can we separate this a bit to 3 parts:
|
|
Cool, I think for this PR I'll just add the refs() related APIS to Table. Will do the other PRs separately. |
0f37b2a to
cb24b7a
Compare
| } | ||
|
|
||
| @Override | ||
| public SnapshotRef ref(String ref) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels a bit awkward for metadata table to have refs... but given the fact that we are just mapping this to methods of the source table() for all the other methods, this seems to be the only way to do it
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
|
@nastra any additional comments? If not I think this is a straight-forward change and will merge later. |
| * | ||
| * @return the current refs for the table | ||
| */ | ||
| Map<String, SnapshotRef> refs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to expose SnapshotRef in the Table API? Why not return Snapshot for ref names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it made sense to expose Refs in the Table API because refs are maintained per table and any operation which leverages the table API can easily access them.
I think refs() is helpful mostly as a convenience method to list all the refs in a table (similar being able to list the table properties, or the schemas/partition specs etc).
For the cases where a user of the table API knows what ref they are looking for then the ref(ref) becomes helpful. If we don't return Snapshot for a given ref name, then a caller has to do
table.snapshot(table.refs(name).snashotId())
and they have to also take care of the null check in case a ref with name does not exist.
So my conclusion is the following:
1.) Expose the SnapshotRef in the Table API should be fine, because refs are maintained at the table level and an API expose to them either as a collection or a convenience method to look them up by name fits the model, and could be used.
2.) It will be common to want the snapshot for a given ref, and it also makes sense to have an API for returning a Snapshot for a given ref name.
Alternative:
The tradeoff of the above is the API is less minimal. For the purpose of the table scan we will ultimately need the snapshot for a given ref. So if we want to start minimal, what we can do is add the
Snapshot snapshot(String ref) signature and only when the refs(), ref(String name) are truly needed we add those.
@rdblue @jackye1995 @singhpk234 Let me know your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'd probably add the snapshot(String name) method and put off adding methods to Table until we know that we will definitely need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, one thing I forgot was for SerializableTable, we'll need to pass the refs through https://github.com/apache/iceberg/pull/4428/files#diff-46dafb425240806166ccc8f27a6c301781c5082bf1a187291cc92c2a8f17a588R85 so we'll need Map<String, SnapshotRef> refs() to implement the contract of snapshot(String name) in SerializableTable.
So then we need both refs() and snapshot(String name).
I think can leave off the ref(String name) for now, until we know that convenience method is really useful; I anticipate it will become useful mostly just to avoid another level of indirection when looking up from the map, but we can aim for just keeping the API changes minimal for now.
1a16032 to
413e75a
Compare
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| * | ||
| * @return the current refs for the table | ||
| */ | ||
| Map<String, SnapshotRef> refs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also have a default that creates a map of "main" to the current snapshot ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one way to go, I've been thinking it would actually be simpler if when parsing the metadata, if a main branch does not exist, set it to the current snapshot. That way a lot of the API logic can rely on this assumption that main exists and we don't need a lot of special code for the case where the new Iceberg library is reading an older metadata file where refs (and thus main) may not exist.
This PR would be blocked on a PR for doing that so I will raise that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds reasonable as well.
|
@amogh-jahagirdar Is there any pending commit to this PR? I would like to start contributing for spark integration for this. |
@namratha2403 I raised #4922 for addressing #4428 (comment) separately. I'll be updating that PR later today. At that point this PR would be unblocked or if we conclude not to set the main ref to the current snapshot when parsing, we would have a default implementation for refs() here. cc: @rdblue |
@rdblue Since we now are guaranteed to have a main ref when parsing metadata
|
|
Looks like we have a consensus for this, @amogh-jahagirdar could you do a rebase to make sure spotless check passes? |
57cfac3 to
3deb581
Compare
3deb581 to
973e89a
Compare
|
Thanks for the work! Given the conversation history I think @rdblue is also good with adding these APIs, and other people have all approved. Thanks for all the reviews! |
This PR adds the ability to perform a table scan from a given branch or tag. Spark integration will be in a separate PR. Related to #3899