Spark 3.3: support version travel by reference name #6575

jackye1995 · 2023-01-12T22:13:51Z

Similar to the Trino PR I am trying to push trinodb/trino#15646, support using reference name for the VERSION AS OF clause.

We have a related PR #5294 by @hililiwei that tries to directly add the reference info in table name, while we are consolidating that experience, I think we can first have this feature added in parallel.

@amogh-jahagirdar

ajantha-bhat · 2023-01-13T07:15:23Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+        ValidationException.check(
+            ref != null,
+            "Cannot find matching snapshot ID or reference name for version " + version);
+        return sparkTable.copyWithSnapshotId(ref.snapshotId());


It always use the latest commit from the reference.
I think we also need to provide a way to time travel to a snapshot within a branch/tag?

So along with existing of FOR SYSTEM_VERSION AS OF snapshotId
we should have FOR SYSTEM_VERSION AS OF snapshotId@refName

But whether to use '@' or some other syntax is an open point for a long time which @rdblue wanted to conclude.

Nessie SQL syntax for reference:
https://projectnessie.org/tools/sql/#grammar

Never mind.
After thinking a bit more about it and reviewing #6573, As the snapshot log contains all the snapshots from all the branches/tags. If we want to use any particular snapshot, we can directly use snapshot-id without specifying branch/tag information. So, no need of snapshotId@refName syntax

I think one thing that might be useful is to time travel in a branch, something like FOR SYSTEM_VERSION AS OF branchName@123456789012. But that feels very hacky, I' rather have some syntax as we have been suggesting like SELECT * FROM table BRANCH branch FOR SYSTEM_TIME AS OF xxx. So I am leaving that part out of the implementation for now. At least I think most people can agree that a tag/branch head can be viewed as a version to travel to.

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java

jackye1995 · 2023-01-14T02:02:52Z

@aokolnychyi @RussellSpitzer @rdblue any opinions about this support?

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java

RussellSpitzer · 2023-01-14T13:01:43Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

          "Cannot do time-travel based on both table identifier and AS OF");

-      return sparkTable.copyWithSnapshotId(Long.parseLong(version));
+      try {


I'm not sure this can come up, but do we allow version tags to be SnapshotIds?

Like can I tag snapshot 2 to be known as 1?

Weird edge case so I don't think we really need to handle it, just thinking if this is a potential issue with the lookup code here

Currently there's no restrictions on what references can be named. For the lookup code, I think we should always be able to differentiate between snapshot ID and ref since for refs it will be in a quoted identifier, and should always fail the Long.parseLong() with a NumberParseException. So the current implementation seems good to me.

But that's just me reading the code :), I think it's worth having a unit test just for this case to give us that confidence that it works as expected in this scenario. cc @jackye1995 let me know your thoughts

I added a test case specifically for this. Unlike Trino, Spark directly ignores the type of the VERSION AS OF, so if a tag name matches exactly the snapshot ID, then snapshot ID is always chosen.

I think this is a okay limitation, because people can work around it by adding some text like snapshot-123456890 as the tag name. But we should make it very clear in documentation.

Yeah I don't want this to be a blocker, just something to take note of.

RussellSpitzer

Looks good to me 👍

ajantha-bhat · 2023-01-16T02:42:31Z

we also need to update this SQL syntax in
https://github.com/apache/iceberg/blob/master/docs/spark-queries.md#sql

jackye1995 · 2023-01-18T18:25:43Z

we also need to update this SQL syntax in https://github.com/apache/iceberg/blob/master/docs/spark-queries.md#sql

I think @amogh-jahagirdar is working on a more detailed doc for branching and tagging, I will leave this part to him 😝

jackye1995 · 2023-01-18T23:49:50Z

Thanks for the review everyone, I think there are enough votes and all concerns are addressed, I will go ahead merging this PR!

github-actions bot added the spark label Jan 12, 2023

ajantha-bhat reviewed Jan 13, 2023

View reviewed changes

ajantha-bhat mentioned this pull request Jan 13, 2023

Docs: Add information on how to read from branches and tags in Spark docs #6573

Merged

nastra approved these changes Jan 13, 2023

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java Outdated Show resolved Hide resolved

findinpath mentioned this pull request Jan 14, 2023

Support Iceberg version travel by reference name trinodb/trino#15646

Closed

nastra reviewed Jan 14, 2023

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jan 14, 2023

View reviewed changes

RussellSpitzer approved these changes Jan 14, 2023

View reviewed changes

Jack Ye added 5 commits January 18, 2023 11:16

Spark 3.3: support version travel by reference name

f1b5466

Use assertj library

4c919ec

Add test case and fix assertj check

f6dd28d

fix checkstyle

b1a3c56

fix test failure

1cb4a0c

amogh-jahagirdar approved these changes Jan 18, 2023

View reviewed changes

jackye1995 merged commit c05d035 into apache:master Jan 18, 2023

This was referenced Jan 19, 2023

Add Spark compatibility test for Iceberg version travel by reference name trinodb/trino#15791

Open

Delta: Support Snapshot Delta Lake Table to Iceberg Table #6449

Merged

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Spark 3.3: support version travel by reference name (apache#6575)

3c31900

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

Spark 3.3: support version travel by reference name (apache#6575)

b3c685f

Spark 3.3: support version travel by reference name #6575

Spark 3.3: support version travel by reference name #6575

Uh oh!

Conversation

jackye1995 commented Jan 12, 2023

Uh oh!

ajantha-bhat Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jackye1995 commented Jan 14, 2023

Uh oh!

Uh oh!

RussellSpitzer Jan 14, 2023

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jan 18, 2023

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 18, 2023

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat commented Jan 16, 2023

Uh oh!

jackye1995 commented Jan 18, 2023

Uh oh!

jackye1995 commented Jan 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amogh-jahagirdar Jan 18, 2023 •

edited

Loading