Skip to content

Conversation

@jackye1995
Copy link
Contributor

Similar to the Trino PR I am trying to push trinodb/trino#15646, support using reference name for the VERSION AS OF clause.

We have a related PR #5294 by @hililiwei that tries to directly add the reference info in table name, while we are consolidating that experience, I think we can first have this feature added in parallel.

@amogh-jahagirdar

@github-actions github-actions bot added the spark label Jan 12, 2023
ValidationException.check(
ref != null,
"Cannot find matching snapshot ID or reference name for version " + version);
return sparkTable.copyWithSnapshotId(ref.snapshotId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always use the latest commit from the reference.
I think we also need to provide a way to time travel to a snapshot within a branch/tag?

So along with existing of FOR SYSTEM_VERSION AS OF snapshotId
we should have FOR SYSTEM_VERSION AS OF snapshotId@refName

But whether to use '@' or some other syntax is an open point for a long time which @rdblue wanted to conclude.

Nessie SQL syntax for reference:
https://projectnessie.org/tools/sql/#grammar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind.
After thinking a bit more about it and reviewing #6573, As the snapshot log contains all the snapshots from all the branches/tags. If we want to use any particular snapshot, we can directly use snapshot-id without specifying branch/tag information. So, no need of snapshotId@refName syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one thing that might be useful is to time travel in a branch, something like FOR SYSTEM_VERSION AS OF branchName@123456789012. But that feels very hacky, I' rather have some syntax as we have been suggesting like SELECT * FROM table BRANCH branch FOR SYSTEM_TIME AS OF xxx. So I am leaving that part out of the implementation for now. At least I think most people can agree that a tag/branch head can be viewed as a version to travel to.

@jackye1995
Copy link
Contributor Author

@aokolnychyi @RussellSpitzer @rdblue any opinions about this support?

"Cannot do time-travel based on both table identifier and AS OF");

return sparkTable.copyWithSnapshotId(Long.parseLong(version));
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this can come up, but do we allow version tags to be SnapshotIds?

Like can I tag snapshot 2 to be known as 1?

Weird edge case so I don't think we really need to handle it, just thinking if this is a potential issue with the lookup code here

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there's no restrictions on what references can be named. For the lookup code, I think we should always be able to differentiate between snapshot ID and ref since for refs it will be in a quoted identifier, and should always fail the Long.parseLong() with a NumberParseException. So the current implementation seems good to me.

But that's just me reading the code :), I think it's worth having a unit test just for this case to give us that confidence that it works as expected in this scenario. cc @jackye1995 let me know your thoughts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test case specifically for this. Unlike Trino, Spark directly ignores the type of the VERSION AS OF, so if a tag name matches exactly the snapshot ID, then snapshot ID is always chosen.

I think this is a okay limitation, because people can work around it by adding some text like snapshot-123456890 as the tag name. But we should make it very clear in documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't want this to be a blocker, just something to take note of.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

@ajantha-bhat
Copy link
Member

@jackye1995
Copy link
Contributor Author

we also need to update this SQL syntax in https://github.com/apache/iceberg/blob/master/docs/spark-queries.md#sql

I think @amogh-jahagirdar is working on a more detailed doc for branching and tagging, I will leave this part to him 😝

@jackye1995
Copy link
Contributor Author

Thanks for the review everyone, I think there are enough votes and all concerns are addressed, I will go ahead merging this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants