-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark 3.3: support version travel by reference name #6575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ValidationException.check( | ||
| ref != null, | ||
| "Cannot find matching snapshot ID or reference name for version " + version); | ||
| return sparkTable.copyWithSnapshotId(ref.snapshotId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It always use the latest commit from the reference.
I think we also need to provide a way to time travel to a snapshot within a branch/tag?
So along with existing of FOR SYSTEM_VERSION AS OF snapshotId
we should have FOR SYSTEM_VERSION AS OF snapshotId@refName
But whether to use '@' or some other syntax is an open point for a long time which @rdblue wanted to conclude.
Nessie SQL syntax for reference:
https://projectnessie.org/tools/sql/#grammar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind.
After thinking a bit more about it and reviewing #6573, As the snapshot log contains all the snapshots from all the branches/tags. If we want to use any particular snapshot, we can directly use snapshot-id without specifying branch/tag information. So, no need of snapshotId@refName syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one thing that might be useful is to time travel in a branch, something like FOR SYSTEM_VERSION AS OF branchName@123456789012. But that feels very hacky, I' rather have some syntax as we have been suggesting like SELECT * FROM table BRANCH branch FOR SYSTEM_TIME AS OF xxx. So I am leaving that part out of the implementation for now. At least I think most people can agree that a tag/branch head can be viewed as a version to travel to.
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java
Outdated
Show resolved
Hide resolved
|
@aokolnychyi @RussellSpitzer @rdblue any opinions about this support? |
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java
Outdated
Show resolved
Hide resolved
| "Cannot do time-travel based on both table identifier and AS OF"); | ||
|
|
||
| return sparkTable.copyWithSnapshotId(Long.parseLong(version)); | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this can come up, but do we allow version tags to be SnapshotIds?
Like can I tag snapshot 2 to be known as 1?
Weird edge case so I don't think we really need to handle it, just thinking if this is a potential issue with the lookup code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently there's no restrictions on what references can be named. For the lookup code, I think we should always be able to differentiate between snapshot ID and ref since for refs it will be in a quoted identifier, and should always fail the Long.parseLong() with a NumberParseException. So the current implementation seems good to me.
But that's just me reading the code :), I think it's worth having a unit test just for this case to give us that confidence that it works as expected in this scenario. cc @jackye1995 let me know your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test case specifically for this. Unlike Trino, Spark directly ignores the type of the VERSION AS OF, so if a tag name matches exactly the snapshot ID, then snapshot ID is always chosen.
I think this is a okay limitation, because people can work around it by adding some text like snapshot-123456890 as the tag name. But we should make it very clear in documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't want this to be a blocker, just something to take note of.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
|
we also need to update this SQL syntax in |
I think @amogh-jahagirdar is working on a more detailed doc for branching and tagging, I will leave this part to him 😝 |
|
Thanks for the review everyone, I think there are enough votes and all concerns are addressed, I will go ahead merging this PR! |
Similar to the Trino PR I am trying to push trinodb/trino#15646, support using reference name for the
VERSION AS OFclause.We have a related PR #5294 by @hililiwei that tries to directly add the reference info in table name, while we are consolidating that experience, I think we can first have this feature added in parallel.
@amogh-jahagirdar