Core, API: Add getting refs and snapshot by ref to the Table API #4428

amogh-jahagirdar · 2022-03-30T04:33:47Z

This PR adds the ability to perform a table scan from a given branch or tag. Spark integration will be in a separate PR. Related to #3899

amogh-jahagirdar · 2022-03-30T04:36:53Z

@rdblue @jackye1995 Let me know what you think!

amogh-jahagirdar · 2022-03-30T04:41:01Z

Investigating failing core tests

nastra · 2022-03-30T16:26:26Z

api/src/main/java/org/apache/iceberg/TableScan.java

+   * @return a new scan based on this with the given branch
+   * @throws IllegalArgumentException if the branch cannot be found
+   */
+  TableScan branch(String branch);


does the distinction between branch/tag really matter here or could we just have a TableScan ref(String ref) method?

It's a good point. My thought was that from an API point of view people would be more familiar with the terms branch and tag, so it would be easier to have those methods. I also think if is using this, they know before hand if it's a branch or a tag, so it's not cumbersome.

But yeah currently, the logic is effectively the same which is to use the snapshot associated with the reference.

Thoughts?

I would be in favor of just using TableScan ref(String ref) / TableScan reference(String ref) since usually people are familiar with the term reference when it comes to branches/tag, but let's see what others think

Would useRef be better? Just to match useSnapshot.

Actually I think there is a need to distinguish useBranch and useTag, because useBranch can be used in combiantion with asOfTime to perform time travel against a specific branch.

jackye1995 · 2022-05-25T06:10:31Z

api/src/main/java/org/apache/iceberg/Table.java

+   *
+   * @return the ref with the given name or null if it does not exist.
+   */
+  SnapshotRef ref(String ref);


do we really need this method, if it's always refs().get(ref)

oh nvm we also have the method in TableMetadata, so probably better to have it here

Right, my thinking was just to keep it the same

Yeah, I think it's nice to provide the convenience method.

jackye1995 · 2022-05-25T06:17:08Z

Can we separate this a bit to 3 parts:

we need to definitely add refs() related APIs to Table related places
we want to support useTag that is basically the same as useSnapshot
we want to support useBranch that can be used in combination with asOfTime

amogh-jahagirdar · 2022-05-26T21:24:21Z

Cool, I think for this PR I'll just add the refs() related APIS to Table. Will do the other PRs separately.

jackye1995 · 2022-05-27T06:16:01Z

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java

  }

+  @Override
+  public SnapshotRef ref(String ref) {


feels a bit awkward for metadata table to have refs... but given the fact that we are just mapping this to methods of the source table() for all the other methods, this seems to be the only way to do it

jackye1995

looks good to me

jackye1995 · 2022-05-27T16:38:55Z

@nastra any additional comments? If not I think this is a straight-forward change and will merge later.

rdblue · 2022-05-27T17:57:58Z

api/src/main/java/org/apache/iceberg/Table.java

+   *
+   * @return the current refs for the table
+   */
+  Map<String, SnapshotRef> refs();


Do we want to expose SnapshotRef in the Table API? Why not return Snapshot for ref names?

I was thinking it made sense to expose Refs in the Table API because refs are maintained per table and any operation which leverages the table API can easily access them.

I think refs() is helpful mostly as a convenience method to list all the refs in a table (similar being able to list the table properties, or the schemas/partition specs etc).

For the cases where a user of the table API knows what ref they are looking for then the ref(ref) becomes helpful. If we don't return Snapshot for a given ref name, then a caller has to do

table.snapshot(table.refs(name).snashotId())

and they have to also take care of the null check in case a ref with name does not exist.

So my conclusion is the following:

1.) Expose the SnapshotRef in the Table API should be fine, because refs are maintained at the table level and an API expose to them either as a collection or a convenience method to look them up by name fits the model, and could be used.

2.) It will be common to want the snapshot for a given ref, and it also makes sense to have an API for returning a Snapshot for a given ref name.

Alternative:

The tradeoff of the above is the API is less minimal. For the purpose of the table scan we will ultimately need the snapshot for a given ref. So if we want to start minimal, what we can do is add the

Snapshot snapshot(String ref) signature and only when the refs(), ref(String name) are truly needed we add those.

@rdblue @jackye1995 @singhpk234 Let me know your thoughts!

For now, I'd probably add the snapshot(String name) method and put off adding methods to Table until we know that we will definitely need them.

Actually, one thing I forgot was for SerializableTable, we'll need to pass the refs through https://github.com/apache/iceberg/pull/4428/files#diff-46dafb425240806166ccc8f27a6c301781c5082bf1a187291cc92c2a8f17a588R85 so we'll need Map<String, SnapshotRef> refs() to implement the contract of snapshot(String name) in SerializableTable.

So then we need both refs() and snapshot(String name).

I think can leave off the ref(String name) for now, until we know that convenience method is really useful; I anticipate it will become useful mostly just to avoid another level of indirection when looking up from the map, but we can aim for just keeping the API changes minimal for now.

api/src/main/java/org/apache/iceberg/Table.java

nastra

LGTM

rdblue · 2022-05-30T15:01:41Z

api/src/main/java/org/apache/iceberg/Table.java

+   *
+   * @return the current refs for the table
+   */
+  Map<String, SnapshotRef> refs();


Shouldn't this also have a default that creates a map of "main" to the current snapshot ID?

That's one way to go, I've been thinking it would actually be simpler if when parsing the metadata, if a main branch does not exist, set it to the current snapshot. That way a lot of the API logic can rely on this assumption that main exists and we don't need a lot of special code for the case where the new Iceberg library is reading an older metadata file where refs (and thus main) may not exist.

This PR would be blocked on a PR for doing that so I will raise that.

Yeah, that sounds reasonable as well.

namrathamyske · 2022-06-01T05:08:44Z

@amogh-jahagirdar Is there any pending commit to this PR? I would like to start contributing for spark integration for this.

amogh-jahagirdar · 2022-06-01T16:51:24Z

@amogh-jahagirdar Is there any pending commit to this PR? I would like to start contributing for spark integration for this.

@namratha2403 I raised #4922 for addressing #4428 (comment) separately. I'll be updating that PR later today. At that point this PR would be unblocked or if we conclude not to set the main ref to the current snapshot when parsing, we would have a default implementation for refs() here.

cc: @rdblue

amogh-jahagirdar · 2022-07-01T03:30:29Z

@amogh-jahagirdar Is there any pending commit to this PR? I would like to start contributing for spark integration for this.

@namratha2403 I raised #4922 for addressing #4428 (comment) separately. I'll be updating that PR later today. At that point this PR would be unblocked or if we conclude not to set the main ref to the current snapshot when parsing, we would have a default implementation for refs() here.

cc: @rdblue

@rdblue Since we now are guaranteed to have a main ref when parsing metadata

iceberg/core/src/main/java/org/apache/iceberg/TableMetadataParser.java

Line 426 in dec5679

    
           refs = ImmutableMap.of(SnapshotRef.MAIN_BRANCH, SnapshotRef.branchBuilder(currentSnapshotId).build());

I think the code in this change can operate with that assumption. Let me know what you think!

jackye1995 · 2022-08-04T22:30:24Z

Looks like we have a consensus for this, @amogh-jahagirdar could you do a rebase to make sure spotless check passes?

jackye1995 · 2022-08-05T16:35:09Z

Thanks for the work! Given the conversation history I think @rdblue is also good with adding these APIs, and other people have all approved. Thanks for all the reviews!

…Table API (apache#4428)

github-actions bot added API core labels Mar 30, 2022

amogh-jahagirdar marked this pull request as ready for review March 30, 2022 04:35

amogh-jahagirdar force-pushed the scan-from-ref branch from c75d13c to 0f37b2a Compare March 30, 2022 04:56

nastra reviewed Mar 30, 2022

View reviewed changes

jackye1995 reviewed May 25, 2022

View reviewed changes

amogh-jahagirdar force-pushed the scan-from-ref branch from 0f37b2a to cb24b7a Compare May 26, 2022 22:20

amogh-jahagirdar changed the title ~~Core, API: Support table scan from a ref~~ Core, API: Add refs/ref methods to Table and related subclasses May 26, 2022

amogh-jahagirdar changed the title ~~Core, API: Add refs/ref methods to Table and related subclasses~~ Core, API: Add refs/ref methods to Table May 26, 2022

amogh-jahagirdar requested a review from jackye1995 May 26, 2022 23:44

jackye1995 reviewed May 27, 2022

View reviewed changes

jackye1995 approved these changes May 27, 2022

View reviewed changes

xiaoxuandev approved these changes May 27, 2022

View reviewed changes

singhpk234 approved these changes May 27, 2022

View reviewed changes

rdblue reviewed May 27, 2022

View reviewed changes

amogh-jahagirdar force-pushed the scan-from-ref branch 2 times, most recently from 1a16032 to 413e75a Compare May 30, 2022 04:40

amogh-jahagirdar commented May 30, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/Table.java Show resolved Hide resolved

amogh-jahagirdar changed the title ~~Core, API: Add refs/ref methods to Table~~ Core, API: Add getting refs and snapshot by ref to the Table API May 30, 2022

nastra approved these changes May 30, 2022

View reviewed changes

rdblue reviewed May 30, 2022

View reviewed changes

hililiwei mentioned this pull request Jun 14, 2022

Flink: Use Tag or Branch to scan data. #5029

Merged

namrathamyske mentioned this pull request Jun 28, 2022

Spark Integration to read from Snapshot ref #5150

Merged

amogh-jahagirdar requested a review from rdblue July 1, 2022 04:33

hililiwei approved these changes Jul 4, 2022

View reviewed changes

hililiwei mentioned this pull request Jul 18, 2022

Spark: Spark SQL read from Snapshot ref #5294

Closed

amogh-jahagirdar force-pushed the scan-from-ref branch 5 times, most recently from 57cfac3 to 3deb581 Compare August 4, 2022 23:09

Core, API: Add getting refs and snapshot by ref to the Table API

973e89a

amogh-jahagirdar force-pushed the scan-from-ref branch from 3deb581 to 973e89a Compare August 4, 2022 23:09

jackye1995 approved these changes Aug 5, 2022

View reviewed changes

jackye1995 merged commit dafb480 into apache:master Aug 5, 2022

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

[Cherry-Pick] Core, API: Add getting refs and snapshot by ref to the …

317b705

…Table API (apache#4428)

Core, API: Add getting refs and snapshot by ref to the Table API #4428

Core, API: Add getting refs and snapshot by ref to the Table API #4428

Uh oh!

Conversation

amogh-jahagirdar commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar commented Mar 30, 2022

Uh oh!

amogh-jahagirdar commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented May 25, 2022

Uh oh!

amogh-jahagirdar commented May 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented May 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namrathamyske commented Jun 1, 2022

Uh oh!

amogh-jahagirdar commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar commented Jul 1, 2022

Uh oh!

jackye1995 commented Aug 4, 2022

Uh oh!

jackye1995 commented Aug 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

amogh-jahagirdar commented Mar 30, 2022 •

edited

Loading

amogh-jahagirdar commented Mar 30, 2022 •

edited

Loading

amogh-jahagirdar Mar 30, 2022 •

edited

Loading

amogh-jahagirdar May 30, 2022 •

edited

Loading

amogh-jahagirdar commented Jun 1, 2022 •

edited

Loading