Skip to content

Conversation

@hililiwei
Copy link
Contributor

@hililiwei hililiwei commented Jun 14, 2022

What is the purpose of the change

Scan data using a specified tag or branch.

Brief change log

Add the following syntax
SQL:

* SELECT * FROM sample /*+ OPTIONS('tag'='t1')*/ ;
* SELECT * FROM sample /*+ OPTIONS('branch'='t1')*/ ;

API:

    public Builder tag(String tag) {
      contextBuilder.useTag(tag);
      return this;
    }

    public Builder branch(String branch) {
      contextBuilder.useBranch(branch);
      return this;
    }
    public Builder startTag(String startTag) {
      contextBuilder.startTag(startTag);
      return this;
    }

    public Builder endTag(String endTag) {
      contextBuilder.endTag(endTag);
      return this;
    }

cc @amogh-jahagirdar

@github-actions github-actions bot added the flink label Jun 14, 2022
@amogh-jahagirdar
Copy link
Contributor

This has a dependency on https://github.com/apache/iceberg/pull/4428/files, I will follow up on that PR.

@hililiwei
Copy link
Contributor Author

As the core is being modified, this PR needs to wait for #5364 and #5475 to complete.

@hililiwei hililiwei closed this Oct 10, 2022
@hililiwei hililiwei reopened this Oct 10, 2022
@hililiwei hililiwei force-pushed the ref-flink branch 3 times, most recently from c47b6f8 to 2db6835 Compare October 13, 2022 14:38
@hililiwei
Copy link
Contributor Author

cc @amogh-jahagirdar @stevenzwu

@hililiwei
Copy link
Contributor Author

cc @amogh-jahagirdar @stevenzwu @rdblue, could you please take a look when you are available?

@hililiwei hililiwei force-pushed the ref-flink branch 2 times, most recently from be61f72 to f257c2e Compare January 30, 2023 11:46
@jackye1995 jackye1995 self-requested a review January 30, 2023 22:30
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @hililiwei ! left some comments/questions it makes sense to just carry this PR forward based on our discussion in #6660

scanContext.branch() == null
? table.currentSnapshot()
: table.snapshot(scanContext.branch());
Preconditions.checkNotNull(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll help if at some point (probably options/conf default) we set the default branch to main and always assume a branch is set in this code. Then we can just do table.snapshot(branch).

Nit on log message:

"No snapshots on branch %s in table %s"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, can we actually do that in this PR? (can we check what is done in Spark, if we decided to go with using main branch or using if else? I cannot remember now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark doesn't seem to use Main as the default.

My main concern is that main is not compatible with older version iceberg table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we read the table metadata with the current Iceberg version we will set the main branch. https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L444 so it will be compatible with older version of tables.

Comment on lines 138 to 153
long currentSnapshotId = currentSnapshot.snapshotId();
long startSnapshotId = table.snapshot(scanContext.startTag()).snapshotId();
Preconditions.checkState(
SnapshotUtil.isAncestorOf(table, currentSnapshotId, startSnapshotId),
"The option start-snapshot-id %s is not an ancestor of the current snapshot.",
startSnapshotId);

lastSnapshotId = startSnapshotId;
} else if (scanContext.startSnapshotId() != null) {
Snapshot currentSnapshot =
scanContext.branch() == null
? table.currentSnapshot()
: table.snapshot(scanContext.branch());
Preconditions.checkNotNull(
table.currentSnapshot(), "Don't have any available snapshot in table.");
currentSnapshot,
"Don't have any available snapshot for branch " + scanContext.branch() + " in table.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also sorry if I missed it, but at some point shouldn't there be validation that the start tag is an ancestor of the end tag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are eventually converted to snapshot id and checked during the planTasks phase

@jackye1995 jackye1995 requested a review from stevenzwu February 3, 2023 01:24
@jackye1995
Copy link
Contributor

@stevenzwu since you are reviewing #6660, could you also take a look at this?

@hililiwei hililiwei force-pushed the ref-flink branch 5 times, most recently from 06aa678 to 44f0e5e Compare February 7, 2023 07:06
Preconditions.checkArgument(
branch == null,
String.format(
"Cannot scan table using ref %s configured for streaming reader yet", branch));
Copy link
Contributor Author

@hililiwei hililiwei Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5984 seems to be the prerequisite for flink to implement stream incremental read with branch.

} else if (scanContext.startTag() != null || scanContext.startSnapshotId() != null) {
Preconditions.checkArgument(
!(scanContext.startTag() != null && scanContext.startSnapshotId() != null),
"START_SNAPSHOT_ID and START_TAG cannot be used both.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cannot both be set?

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, just 1 nit comment

@hililiwei
Copy link
Contributor Author

@stevenzwu addressed the round of comments, PTAL, thx.


public void appendToTable(List<Record> records) throws IOException {
appendToTable(null, records);
appendToTable(null, null, records);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be appendToTable(null, records), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it won't work because we have two methods:
appendToTable(String branch, List<Record> records)
and
appendToTable(StructLike partition, List<Record> records)

Unless I rearrange the order of branch and place it after records, but I prefer to leave it as it is currently, at the front.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @hililiwei for the contribution!

@jackye1995
Copy link
Contributor

Looks like all comments are addressed and we have enough votes, thanks for the work @hililiwei , and thanks for the review @stevenzwu and @amogh-jahagirdar !

@jackye1995 jackye1995 merged commit 6cd3d24 into apache:master Feb 14, 2023
@hililiwei hililiwei deleted the ref-flink branch February 15, 2023 06:25
@hililiwei
Copy link
Contributor Author

Thanks for the review @jackye1995 @stevenzwu @amogh-jahagirdar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants