Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Mar 30, 2022

This PR adds the ability to perform a table scan from a given branch or tag. Spark integration will be in a separate PR. Related to #3899

@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review March 30, 2022 04:35
@amogh-jahagirdar
Copy link
Contributor Author

@rdblue @jackye1995 Let me know what you think!

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Mar 30, 2022

Investigating failing core tests

* @return a new scan based on this with the given branch
* @throws IllegalArgumentException if the branch cannot be found
*/
TableScan branch(String branch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the distinction between branch/tag really matter here or could we just have a TableScan ref(String ref) method?

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. My thought was that from an API point of view people would be more familiar with the terms branch and tag, so it would be easier to have those methods. I also think if is using this, they know before hand if it's a branch or a tag, so it's not cumbersome.

But yeah currently, the logic is effectively the same which is to use the snapshot associated with the reference.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of just using TableScan ref(String ref) / TableScan reference(String ref) since usually people are familiar with the term reference when it comes to branches/tag, but let's see what others think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would useRef be better? Just to match useSnapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think there is a need to distinguish useBranch and useTag, because useBranch can be used in combiantion with asOfTime to perform time travel against a specific branch.

*
* @return the ref with the given name or null if it does not exist.
*/
SnapshotRef ref(String ref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this method, if it's always refs().get(ref)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nvm we also have the method in TableMetadata, so probably better to have it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my thinking was just to keep it the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's nice to provide the convenience method.

@jackye1995
Copy link
Contributor

Can we separate this a bit to 3 parts:

  1. we need to definitely add refs() related APIs to Table related places
  2. we want to support useTag that is basically the same as useSnapshot
  3. we want to support useBranch that can be used in combination with asOfTime

@amogh-jahagirdar
Copy link
Contributor Author

Cool, I think for this PR I'll just add the refs() related APIS to Table. Will do the other PRs separately.

@amogh-jahagirdar amogh-jahagirdar changed the title Core, API: Support table scan from a ref Core, API: Add refs/ref methods to Table and related subclasses May 26, 2022
@amogh-jahagirdar amogh-jahagirdar changed the title Core, API: Add refs/ref methods to Table and related subclasses Core, API: Add refs/ref methods to Table May 26, 2022
}

@Override
public SnapshotRef ref(String ref) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels a bit awkward for metadata table to have refs... but given the fact that we are just mapping this to methods of the source table() for all the other methods, this seems to be the only way to do it

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@jackye1995
Copy link
Contributor

@nastra any additional comments? If not I think this is a straight-forward change and will merge later.

*
* @return the current refs for the table
*/
Map<String, SnapshotRef> refs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose SnapshotRef in the Table API? Why not return Snapshot for ref names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it made sense to expose Refs in the Table API because refs are maintained per table and any operation which leverages the table API can easily access them.

I think refs() is helpful mostly as a convenience method to list all the refs in a table (similar being able to list the table properties, or the schemas/partition specs etc).

For the cases where a user of the table API knows what ref they are looking for then the ref(ref) becomes helpful. If we don't return Snapshot for a given ref name, then a caller has to do

table.snapshot(table.refs(name).snashotId())

and they have to also take care of the null check in case a ref with name does not exist.

So my conclusion is the following:

1.) Expose the SnapshotRef in the Table API should be fine, because refs are maintained at the table level and an API expose to them either as a collection or a convenience method to look them up by name fits the model, and could be used.

2.) It will be common to want the snapshot for a given ref, and it also makes sense to have an API for returning a Snapshot for a given ref name.

Alternative:

The tradeoff of the above is the API is less minimal. For the purpose of the table scan we will ultimately need the snapshot for a given ref. So if we want to start minimal, what we can do is add the

Snapshot snapshot(String ref) signature and only when the refs(), ref(String name) are truly needed we add those.

@rdblue @jackye1995 @singhpk234 Let me know your thoughts!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'd probably add the snapshot(String name) method and put off adding methods to Table until we know that we will definitely need them.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one thing I forgot was for SerializableTable, we'll need to pass the refs through https://github.com/apache/iceberg/pull/4428/files#diff-46dafb425240806166ccc8f27a6c301781c5082bf1a187291cc92c2a8f17a588R85 so we'll need Map<String, SnapshotRef> refs() to implement the contract of snapshot(String name) in SerializableTable.

So then we need both refs() and snapshot(String name).

I think can leave off the ref(String name) for now, until we know that convenience method is really useful; I anticipate it will become useful mostly just to avoid another level of indirection when looking up from the map, but we can aim for just keeping the API changes minimal for now.

@amogh-jahagirdar amogh-jahagirdar force-pushed the scan-from-ref branch 2 times, most recently from 1a16032 to 413e75a Compare May 30, 2022 04:40
@amogh-jahagirdar amogh-jahagirdar changed the title Core, API: Add refs/ref methods to Table Core, API: Add getting refs and snapshot by ref to the Table API May 30, 2022
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

*
* @return the current refs for the table
*/
Map<String, SnapshotRef> refs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also have a default that creates a map of "main" to the current snapshot ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one way to go, I've been thinking it would actually be simpler if when parsing the metadata, if a main branch does not exist, set it to the current snapshot. That way a lot of the API logic can rely on this assumption that main exists and we don't need a lot of special code for the case where the new Iceberg library is reading an older metadata file where refs (and thus main) may not exist.

This PR would be blocked on a PR for doing that so I will raise that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds reasonable as well.

@namrathamyske
Copy link
Contributor

@amogh-jahagirdar Is there any pending commit to this PR? I would like to start contributing for spark integration for this.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jun 1, 2022

@amogh-jahagirdar Is there any pending commit to this PR? I would like to start contributing for spark integration for this.

@namratha2403 I raised #4922 for addressing #4428 (comment) separately. I'll be updating that PR later today. At that point this PR would be unblocked or if we conclude not to set the main ref to the current snapshot when parsing, we would have a default implementation for refs() here.

cc: @rdblue

@amogh-jahagirdar
Copy link
Contributor Author

@amogh-jahagirdar Is there any pending commit to this PR? I would like to start contributing for spark integration for this.

@namratha2403 I raised #4922 for addressing #4428 (comment) separately. I'll be updating that PR later today. At that point this PR would be unblocked or if we conclude not to set the main ref to the current snapshot when parsing, we would have a default implementation for refs() here.

cc: @rdblue

@rdblue Since we now are guaranteed to have a main ref when parsing metadata

refs = ImmutableMap.of(SnapshotRef.MAIN_BRANCH, SnapshotRef.branchBuilder(currentSnapshotId).build());
I think the code in this change can operate with that assumption. Let me know what you think!

@amogh-jahagirdar amogh-jahagirdar requested a review from rdblue July 1, 2022 04:33
@jackye1995
Copy link
Contributor

Looks like we have a consensus for this, @amogh-jahagirdar could you do a rebase to make sure spotless check passes?

@amogh-jahagirdar amogh-jahagirdar force-pushed the scan-from-ref branch 5 times, most recently from 57cfac3 to 3deb581 Compare August 4, 2022 23:09
@jackye1995
Copy link
Contributor

Thanks for the work! Given the conversation history I think @rdblue is also good with adding these APIs, and other people have all approved. Thanks for all the reviews!

@jackye1995 jackye1995 merged commit dafb480 into apache:master Aug 5, 2022
zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants