Skip to content

Conversation

@namrathamyske
Copy link
Contributor

@namrathamyske namrathamyske commented Jun 1, 2022

This PR is related to the open issue #3896 to perform table operations on a branch ref. Right now throws UnsupportedOperationException. In laters PR's will include implementation for each operation.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for contributing this!

@namrathamyske
Copy link
Contributor Author

namrathamyske commented Jun 1, 2022

@amogh-jahagirdar Updated all minor changes. Have a question regarding writing the unit test for this. I would need a way to test the refs from table. How can i proceed with that?

@namrathamyske namrathamyske changed the title Append changes to snapshot branch Core, API: Append changes to snapshot branch Jun 1, 2022
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a more fundamental issue here for maintaining the correct snapshot lineage on a branch. The parent of the new snapshot needs to be set to the tip of the branch so that when the operation commits the lineage is correct.

Also there should be validation that the ref is actually a branch and not a tag since we cannot produce snapshots to a tag.

In apply https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L170

If the branch is main, the parent is the current snapshot

If the branch is not main, the parent is the tip of the branch which would be retrieved via table.ref(branch).snapshotId().

ThisT scanManifestsWith(ExecutorService executorService);

/**
* Perform operations a particular branch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a @param to document branch. I think we also need to specify behavior:

  • What happens if the branch does not exist? I think we would create one from the main branch.
  • How does this change validation? What if the validation's from snapshot is not an ancestor of the branch?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@namrathamyske Each SnapshotProducer can implement their own validation logic https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L156

This gets called when trying to apply the operation as part of validating conflicts. For example rewriting data files needs to validate that there are no new row level deletes for data files which were replaced: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java#L108

We should define the behavior of validation in the context of branching.

Copy link
Contributor Author

@namrathamyske namrathamyske Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar @rdblue All validations must start from tip of branch, aka the current snapshot in the branch. So all the operations which override the protected void validate(TableMetadata currentMetadata) { } aka - BaseOverwriteFiles, BaseReplacePartitions, BaseRewriteFiles, BaseRowDelta need to reference the current snapshot from the branch tip.

Is any other validation needed here ?

@namrathamyske
Copy link
Contributor Author

namrathamyske commented Jun 6, 2022

There is a more fundamental issue here for maintaining the correct snapshot lineage on a branch. The parent of the new snapshot needs to be set to the tip of the branch so that when the operation commits the lineage is correct.

Also there should be validation that the ref is actually a branch and not a tag since we cannot produce snapshots to a tag.

In apply https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L170

If the branch is main, the parent is the current snapshot

If the branch is not main, the parent is the tip of the branch which would be retrieved via table.ref(branch).snapshotId().

@amogh-jahagirdar Not only the parent ID, even the manifests need to merged by taking the parent tip into account.

In line

Snapshot current = base.currentSnapshot();

Making a change as following:

if(branchToUpdate != null){

  current = base.snapshot(base.ref(branchToUpdate).snapshotId());

}
else{

  current = base.currentSnapshot();

}

@namrathamyske namrathamyske requested a review from rdblue June 8, 2022 16:36
@namrathamyske namrathamyske force-pushed the master branch 2 times, most recently from e593035 to 3a1964d Compare June 9, 2022 20:20
Append changes to snapshot branch - changes after review comments

remove space

Adding support methods for supporting performing append/delete on branch

Removing spaces

Removing spaces

Removing unwanted vars

Adding check for null and invalid branch

Adding check for null and invalid branch
@namrathamyske
Copy link
Contributor Author

@jackye1995 Is it possible for you to review this?

cc: @SreeramGarlapati

@namrathamyske namrathamyske changed the title Core, API: Append changes to snapshot branch Core, API: Append/Delete changes to snapshot branch Jun 20, 2022
@amogh-jahagirdar
Copy link
Contributor

Replied here #4926 (comment) , but yes as @rdblue pointed out I think we'll need to think of defining conflict detection validation behavior in the presence of branching and tagging (such as the case where validation starts at a snapshot which is not an ancestor of the branch). I'll need to think more on this.

@namrathamyske
Copy link
Contributor Author

namrathamyske commented Jun 21, 2022

@amogh-jahagirdar @rdblue For operations like - BaseOverwriteFiles we call validationHistory method which is fetching ancestors between starting snapshot and current snapshot. After branching is introduced, this can cause inconsistencies. Is my understanding right or am I missing something wrt to validations?

For now if we just introduce branching for append and delete operations, is it going to be fine ? We can extend to overwrite later.

@namrathamyske namrathamyske changed the title Core, API: Append/Delete changes to snapshot branch Core, API: Performing Table Operations to a snapshot branch Jul 9, 2022
@namrathamyske namrathamyske changed the title Core, API: Performing Table Operations to a snapshot branch Core, API: Performing Table Operations to a snapshot branch ref Jul 9, 2022
@namrathamyske
Copy link
Contributor Author

namrathamyske commented Jul 9, 2022

@rdblue @amogh-jahagirdar can someone review this PR once again so that Implementation for child operations Append: #5010 , RowDelta #5181 can move forward ?

@namrathamyske namrathamyske changed the title Core, API: Performing Table Operations to a snapshot branch ref Core, API: Performing Operations on a snapshot branch ref Jul 9, 2022
Assert.assertEquals("Should set changed partition count", "2", changedPartitions);
}

@Test(expected = UnsupportedOperationException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test exceptions, we use either TestHelpers.assertThrows or Assertions. That's important so that you can add assertions about the error message, and so that you can run assertions to validate that state did not change. In this case, the the branch name should not exist and the snapshot list should be the same size.

.appendFile(FILE_A)
.commit();
table.manageSnapshots().createBranch("ref", table.currentSnapshot().snapshotId()).commit();
table.newFastAppend().appendFile(FILE_B).toBranch("ref").commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this implement branch commits for FastAppend? That way we can see one operation that works from end-to-end when reviewing the changes to SnapshotUpdate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Have written the complete test-case in #5010.

Long parentSnapshotId = base.ref(targetBranch) != null ? base.ref(targetBranch).snapshotId() : null;
long sequenceNumber = base.nextSequenceNumber();

// run validations from the child operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to pass the parent snapshot ID into validate along with table metadata (base) so that validations can check changes between the starting snapshot ID and the parent snapshot, rather than between the starting snapshot ID and the current snapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing with apply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue My thought process is to check the whether starting snapshot ID is an ancestor of parent snapshot. If not then throw invalid starting snapshot ID exception. Else we should go ahead with checking between starting snapshot ID & parent ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/namrathamyske/iceberg/pull/1/files should take care of what @rdblue is talking about here which if I'm not mistaken is more about a cleaner abstraction for applying a change to a specific snapshot, and running a validation to a specific snapshot which will be needed when we implement branch support for the other producers.

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Aug 3, 2022

@namrathamyske Thanks for all the work on this! I've raised a PR https://github.com/namrathamyske/iceberg/pull/1/files to your fork, just addressing some of the comments @rdblue brought up, would love to get your thoughts on it. Also @rdblue would like to get your take as well.

namrathamyske added a commit to SreeramGarlapati/iceberg that referenced this pull request Aug 3, 2022
amogh-jahagirdar and others added 2 commits August 3, 2022 23:47
…mplementation, append manifest lists of parent snapshot to the new manifests

Add some more tests
Core: Pass in parent snapshot to apply and validate.
@amogh-jahagirdar
Copy link
Contributor

Thanks for the review on the PR @namrathamyske. @rdblue could we get a review on the updated PR when you get a chance? Thanks!

public void testAppendToExistingBranch() {
table.newFastAppend().appendFile(FILE_A).commit();
table.manageSnapshots().createBranch("branch", table.currentSnapshot().snapshotId()).commit();
table.newFastAppend().appendFile(FILE_B).toBranch("branch").commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized I just hardcoded branch throughout the tests, probably cleaner to just define a variable in the tests and use that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay.

base.currentSnapshot() != null ? base.currentSnapshot().snapshotId() : null;
long sequenceNumber = base.nextSequenceNumber();
Snapshot parentSnapshot = base.currentSnapshot();
if (targetBranch != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also considering targetBranch will always be set (we have it default to Main, we don't need this check).

Comment on lines +306 to +314
AssertHelpers.assertThrows(
"Cannot commit to branch someBranch: org.apache.iceberg.BaseOverwriteFiles does not support branch commits",
UnsupportedOperationException.class,
() ->
table
.newOverwrite()
.overwriteByRowFilter(and(equal("date", "2018-06-09"), lessThan("id", 20)))
.addFile(FILE_10_TO_14)
.toBranch("someBranch"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should fix this assertion so it actually checks the expected error message; I think the assertThrows method with 3 params just surfaces this message in case the test fails, it doesn't actually check the expected error message. Like what's being done in TestReplacePartitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor because it is going to be replaced very soon.

*
* @param metadataToUpdate the base table metadata to apply changes to
* @param snapshot
* @param snapshot parent snapshot passed to child operations.
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Aug 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for adding the commenting. Just a nit, I think this should say something like "snapshot to apply the changes to". Child operations doesn't make too much sense to me imo.

I think we just need to say "Apply the changes with the given metadata to the snapshot".

*
* @param branch which is name of SnapshotRef of type branch.
*/
ThisT toBranch(String branch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a default that throws UnsupportedOperationException.

String.format(
"Cannot commit to branch %s: %s does not support branch commits",
branch, this.getClass().getName()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the interface, not here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, fixed

@rdblue
Copy link
Contributor

rdblue commented Aug 7, 2022

Looks like this is close, but we need to fix the toBranch definition and revapi checks.

@amogh-jahagirdar
Copy link
Contributor

@namrathamyske raised a PR to your fork to address the revapi and moving the default implementation up to the interface level. https://github.com/namrathamyske/iceberg/pull/3/files

amogh-jahagirdar and others added 3 commits August 7, 2022 11:22
@namrathamyske
Copy link
Contributor Author

@rdblue Merged PR with @amogh-jahagirdar's changes which had changes wrt to review comments.

@amogh-jahagirdar
Copy link
Contributor

Thanks @jackye1995 for triggering the approval workflow. @rdblue when you get a chance, let us know how the change looks!

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also looks good to me!

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @namrathamyske !

@jackye1995
Copy link
Contributor

Looks like we are all in agreement, I will merge this to unblock next step implementations. thanks for the contribution @namrathamyske @amogh-jahagirdar ! and thanks for the review @rdblue !

@jackye1995 jackye1995 merged commit 5a15efc into apache:master Aug 11, 2022
zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants