-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark Integration to read from Snapshot ref #5150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 8 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
413e75a
Core, API: Add getting refs and snapshot by ref to the Table API
amogh-jahagirdar 8132d20
Spark 3.2 Integration to read from Snapshot ref
c6cdf5c
checkStyle fixes
1f87b1f
Merge from master
235ff35
Spark 3.2 Integration to read from Snapshot ref
7483cf9
Spark 3.2 Integration to read from Snapshot ref
2be3e6e
Spark 3.1 Integration to read from Snapshot ref
26003f3
Adding checks to snapshot ref
2fe1e0d
Adding support to only spark 3.3
namrathamk 8b29f00
Adding support to only spark 3.3
namrathamk b9a7803
Adding support to only spark 3.3
namrathamk bb61a90
Changes from upstream/master
namrathamyske 55a3595
Adding more tests
namrathamyske 8c86fc8
fix tests
namrathamyske ac763d9
fix tests
namrathamyske f80684f
Merge branch 'apache:master' into spark-integration-ref
namrathamyske fdfb0c6
Test case fix
namrathamyske b5836f6
Merge branch 'spark-integration-ref' of github.com:namrathamyske/iceb…
namrathamyske 4bad10f
Test case fix spotless
namrathamyske 00f07a0
comments fix
namrathamyske f685af4
statement fix
namrathamyske d3fa84e
statement fix
namrathamyske a693c72
adding supressions
namrathamyske ab0832b
changing comments
namrathamyske File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -226,4 +226,62 @@ public void testSnapshotSelectionBySnapshotIdAndTimestamp() throws IOException { | |
| .hasMessageContaining("Cannot specify both snapshot-id") | ||
| .hasMessageContaining("and as-of-timestamp"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testSnapshotSelectionByRef() throws IOException { | ||
| String tableLocation = temp.newFolder("iceberg-table").toString(); | ||
|
||
|
|
||
| HadoopTables tables = new HadoopTables(CONF); | ||
| PartitionSpec spec = PartitionSpec.unpartitioned(); | ||
| Table table = tables.create(SCHEMA, spec, tableLocation); | ||
|
|
||
| // produce the first snapshot | ||
| List<SimpleRecord> firstBatchRecords = Lists.newArrayList( | ||
| new SimpleRecord(1, "a"), | ||
| new SimpleRecord(2, "b"), | ||
| new SimpleRecord(3, "c") | ||
| ); | ||
| Dataset<Row> firstDf = spark.createDataFrame(firstBatchRecords, SimpleRecord.class); | ||
| firstDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation); | ||
|
|
||
| // produce the second snapshot | ||
| List<SimpleRecord> secondBatchRecords = Lists.newArrayList( | ||
| new SimpleRecord(4, "d"), | ||
| new SimpleRecord(5, "e"), | ||
| new SimpleRecord(6, "f") | ||
| ); | ||
| Dataset<Row> secondDf = spark.createDataFrame(secondBatchRecords, SimpleRecord.class); | ||
| secondDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation); | ||
|
|
||
| Assert.assertEquals("Expected 2 snapshots", 2, Iterables.size(table.snapshots())); | ||
|
|
||
| // verify records in the current snapshot | ||
| table.manageSnapshots().createTag("firstag", table.currentSnapshot().snapshotId()).commit(); | ||
| Dataset<Row> currentSnapshotResult = spark.read() | ||
| .format("iceberg") | ||
| .option("snapshot-ref", "firstag") | ||
| .load(tableLocation); | ||
| currentSnapshotResult.show(); | ||
| List<SimpleRecord> currentSnapshotRecords = currentSnapshotResult.orderBy("id") | ||
| .as(Encoders.bean(SimpleRecord.class)) | ||
| .collectAsList(); | ||
| List<SimpleRecord> expectedRecords = Lists.newArrayList(); | ||
| expectedRecords.addAll(firstBatchRecords); | ||
| expectedRecords.addAll(secondBatchRecords); | ||
| Assert.assertEquals("Current snapshot rows should match", expectedRecords, currentSnapshotRecords); | ||
|
|
||
| // verify records in the previous snapshot | ||
| Snapshot currentSnapshot = table.currentSnapshot(); | ||
| Long parentSnapshotId = currentSnapshot.parentId(); | ||
| table.manageSnapshots().createTag("secondtag", parentSnapshotId).commit(); | ||
| Dataset<Row> previousSnapshotResult = spark.read() | ||
| .format("iceberg") | ||
| .option("snapshot-ref", "secondtag") | ||
| .load(tableLocation); | ||
| previousSnapshotResult.show(); | ||
| List<SimpleRecord> previousSnapshotRecords = previousSnapshotResult.orderBy("id") | ||
| .as(Encoders.bean(SimpleRecord.class)) | ||
| .collectAsList(); | ||
| Assert.assertEquals("Previous snapshot rows should match", firstBatchRecords, previousSnapshotRecords); | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -226,4 +226,62 @@ public void testSnapshotSelectionBySnapshotIdAndTimestamp() throws IOException { | |
| .hasMessageContaining("Cannot specify both snapshot-id") | ||
| .hasMessageContaining("and as-of-timestamp"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testSnapshotSelectionByRef() throws IOException { | ||
| String tableLocation = temp.newFolder("iceberg-table").toString(); | ||
|
||
|
|
||
| HadoopTables tables = new HadoopTables(CONF); | ||
| PartitionSpec spec = PartitionSpec.unpartitioned(); | ||
| Table table = tables.create(SCHEMA, spec, tableLocation); | ||
|
|
||
| // produce the first snapshot | ||
| List<SimpleRecord> firstBatchRecords = Lists.newArrayList( | ||
| new SimpleRecord(1, "a"), | ||
| new SimpleRecord(2, "b"), | ||
| new SimpleRecord(3, "c") | ||
| ); | ||
| Dataset<Row> firstDf = spark.createDataFrame(firstBatchRecords, SimpleRecord.class); | ||
| firstDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation); | ||
|
|
||
| // produce the second snapshot | ||
| List<SimpleRecord> secondBatchRecords = Lists.newArrayList( | ||
| new SimpleRecord(4, "d"), | ||
| new SimpleRecord(5, "e"), | ||
| new SimpleRecord(6, "f") | ||
| ); | ||
| Dataset<Row> secondDf = spark.createDataFrame(secondBatchRecords, SimpleRecord.class); | ||
| secondDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation); | ||
|
|
||
| Assert.assertEquals("Expected 2 snapshots", 2, Iterables.size(table.snapshots())); | ||
|
|
||
| // verify records in the current snapshot | ||
| table.manageSnapshots().createTag("firstag", table.currentSnapshot().snapshotId()).commit(); | ||
| Dataset<Row> currentSnapshotResult = spark.read() | ||
| .format("iceberg") | ||
| .option("snapshot-ref", "firstag") | ||
| .load(tableLocation); | ||
| currentSnapshotResult.show(); | ||
| List<SimpleRecord> currentSnapshotRecords = currentSnapshotResult.orderBy("id") | ||
| .as(Encoders.bean(SimpleRecord.class)) | ||
| .collectAsList(); | ||
| List<SimpleRecord> expectedRecords = Lists.newArrayList(); | ||
| expectedRecords.addAll(firstBatchRecords); | ||
| expectedRecords.addAll(secondBatchRecords); | ||
| Assert.assertEquals("Current snapshot rows should match", expectedRecords, currentSnapshotRecords); | ||
|
|
||
| // verify records in the previous snapshot | ||
| Snapshot currentSnapshot = table.currentSnapshot(); | ||
| Long parentSnapshotId = currentSnapshot.parentId(); | ||
| table.manageSnapshots().createTag("secondtag", parentSnapshotId).commit(); | ||
| Dataset<Row> previousSnapshotResult = spark.read() | ||
| .format("iceberg") | ||
| .option("snapshot-ref", "secondtag") | ||
| .load(tableLocation); | ||
| previousSnapshotResult.show(); | ||
| List<SimpleRecord> previousSnapshotRecords = previousSnapshotResult.orderBy("id") | ||
| .as(Encoders.bean(SimpleRecord.class)) | ||
| .collectAsList(); | ||
| Assert.assertEquals("Previous snapshot rows should match", firstBatchRecords, previousSnapshotRecords); | ||
| } | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this this should be
useRef(String branchOrTagName). The termSnapshotRefis internal and I don't think it should be exposed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to separate the useBranch and useTag APIs. As you said, refs are internal. From a Spark user perspective we also want to only expose the branch/tag terms; imo I think the same case could be applied to the API level. Also considering branches can be combined with time travel we could do a separate API for that ; although there's an argument to be made to just combine useBranch + as Of Time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I considered that as well. The problem is that the caller doesn't know whether the ref is a tag or a branch before calling the method. That's determined when we look at table metadata and we don't want to force the caller to do that.
There may be a better name than "ref" for
useRef. That seems like the problem to me. Maybe we could simplify it touse? I'm not sure that's obvious enough.@aokolnychyi, do you have any thoughts on the name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue @amogh-jahagirdar I agree that we can use a common API for tag or branch like
useRef.We have two signatures:
useRef(String refName)useRef(String refName, Long timeStampMillis)-> will throw exception for tag type, since we cant do time travel for tag.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this sounds reasonable. The only thing is I think if we do useRef (or if we come up with a better name) then we would not want to have the
useRef(String refName, Long timeStampMillis). A user would chain it with the existing useTimestamp and then the validation that it's a branch would happen in the scan context.useRef().asOfTime()I don't think we would want the extra method because time travel would only apply for branches so having the ref in that case doesn't make sense to me since it's really only supported for 1 ref type, the branch.If we have consensus on this, then I can update https://github.com/apache/iceberg/pull/5364/files with the updated approach. Then this, PR could be focused on the Spark side of the integration. Will wait to hear what @aokolnychyi suggests as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I would suggest:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see the alternative is just
.useRef(refName).asOfTime(timestampMillis). That also works, in that case +1 foruseRef(String refName)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like there is consensus for
useRef.