-
Notifications
You must be signed in to change notification settings - Fork 3k
Core, API: Performing operations on a snapshot branch ref #4926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for contributing this!
|
@amogh-jahagirdar Updated all minor changes. Have a question regarding writing the unit test for this. I would need a way to test the refs from table. How can i proceed with that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a more fundamental issue here for maintaining the correct snapshot lineage on a branch. The parent of the new snapshot needs to be set to the tip of the branch so that when the operation commits the lineage is correct.
Also there should be validation that the ref is actually a branch and not a tag since we cannot produce snapshots to a tag.
If the branch is main, the parent is the current snapshot
If the branch is not main, the parent is the tip of the branch which would be retrieved via table.ref(branch).snapshotId().
| ThisT scanManifestsWith(ExecutorService executorService); | ||
|
|
||
| /** | ||
| * Perform operations a particular branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a @param to document branch. I think we also need to specify behavior:
- What happens if the branch does not exist? I think we would create one from the
mainbranch. - How does this change validation? What if the validation's from snapshot is not an ancestor of the branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@namrathamyske Each SnapshotProducer can implement their own validation logic https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L156
This gets called when trying to apply the operation as part of validating conflicts. For example rewriting data files needs to validate that there are no new row level deletes for data files which were replaced: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java#L108
We should define the behavior of validation in the context of branching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar @rdblue All validations must start from tip of branch, aka the current snapshot in the branch. So all the operations which override the protected void validate(TableMetadata currentMetadata) { } aka - BaseOverwriteFiles, BaseReplacePartitions, BaseRewriteFiles, BaseRowDelta need to reference the current snapshot from the branch tip.
Is any other validation needed here ?
@amogh-jahagirdar Not only the parent ID, even the manifests need to merged by taking the parent tip into account. In line
Making a change as following: |
e593035 to
3a1964d
Compare
Append changes to snapshot branch - changes after review comments remove space Adding support methods for supporting performing append/delete on branch Removing spaces Removing spaces Removing unwanted vars Adding check for null and invalid branch Adding check for null and invalid branch
|
@jackye1995 Is it possible for you to review this? |
|
Replied here #4926 (comment) , but yes as @rdblue pointed out I think we'll need to think of defining conflict detection validation behavior in the presence of branching and tagging (such as the case where validation starts at a snapshot which is not an ancestor of the branch). I'll need to think more on this. |
|
@amogh-jahagirdar @rdblue For operations like - BaseOverwriteFiles we call For now if we just introduce branching for append and delete operations, is it going to be fine ? We can extend to overwrite later. |
|
@rdblue @amogh-jahagirdar can someone review this PR once again so that Implementation for child operations Append: #5010 , RowDelta #5181 can move forward ? |
| Assert.assertEquals("Should set changed partition count", "2", changedPartitions); | ||
| } | ||
|
|
||
| @Test(expected = UnsupportedOperationException.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test exceptions, we use either TestHelpers.assertThrows or Assertions. That's important so that you can add assertions about the error message, and so that you can run assertions to validate that state did not change. In this case, the the branch name should not exist and the snapshot list should be the same size.
| .appendFile(FILE_A) | ||
| .commit(); | ||
| table.manageSnapshots().createBranch("ref", table.currentSnapshot().snapshotId()).commit(); | ||
| table.newFastAppend().appendFile(FILE_B).toBranch("ref").commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this implement branch commits for FastAppend? That way we can see one operation that works from end-to-end when reviewing the changes to SnapshotUpdate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Long parentSnapshotId = base.ref(targetBranch) != null ? base.ref(targetBranch).snapshotId() : null; | ||
| long sequenceNumber = base.nextSequenceNumber(); | ||
|
|
||
| // run validations from the child operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to pass the parent snapshot ID into validate along with table metadata (base) so that validations can check changes between the starting snapshot ID and the parent snapshot, rather than between the starting snapshot ID and the current snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing with apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue My thought process is to check the whether starting snapshot ID is an ancestor of parent snapshot. If not then throw invalid starting snapshot ID exception. Else we should go ahead with checking between starting snapshot ID & parent ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/namrathamyske/iceberg/pull/1/files should take care of what @rdblue is talking about here which if I'm not mistaken is more about a cleaner abstraction for applying a change to a specific snapshot, and running a validation to a specific snapshot which will be needed when we implement branch support for the other producers.
|
@namrathamyske Thanks for all the work on this! I've raised a PR https://github.com/namrathamyske/iceberg/pull/1/files to your fork, just addressing some of the comments @rdblue brought up, would love to get your thoughts on it. Also @rdblue would like to get your take as well. |
master sync - PR apache#4926
…mplementation, append manifest lists of parent snapshot to the new manifests Add some more tests
Core: Pass in parent snapshot to apply and validate.
|
Thanks for the review on the PR @namrathamyske. @rdblue could we get a review on the updated PR when you get a chance? Thanks! |
| public void testAppendToExistingBranch() { | ||
| table.newFastAppend().appendFile(FILE_A).commit(); | ||
| table.manageSnapshots().createBranch("branch", table.currentSnapshot().snapshotId()).commit(); | ||
| table.newFastAppend().appendFile(FILE_B).toBranch("branch").commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized I just hardcoded branch throughout the tests, probably cleaner to just define a variable in the tests and use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay.
| base.currentSnapshot() != null ? base.currentSnapshot().snapshotId() : null; | ||
| long sequenceNumber = base.nextSequenceNumber(); | ||
| Snapshot parentSnapshot = base.currentSnapshot(); | ||
| if (targetBranch != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also considering targetBranch will always be set (we have it default to Main, we don't need this check).
| AssertHelpers.assertThrows( | ||
| "Cannot commit to branch someBranch: org.apache.iceberg.BaseOverwriteFiles does not support branch commits", | ||
| UnsupportedOperationException.class, | ||
| () -> | ||
| table | ||
| .newOverwrite() | ||
| .overwriteByRowFilter(and(equal("date", "2018-06-09"), lessThan("id", 20))) | ||
| .addFile(FILE_10_TO_14) | ||
| .toBranch("someBranch")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should fix this assertion so it actually checks the expected error message; I think the assertThrows method with 3 params just surfaces this message in case the test fails, it doesn't actually check the expected error message. Like what's being done in TestReplacePartitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor because it is going to be replaced very soon.
| * | ||
| * @param metadataToUpdate the base table metadata to apply changes to | ||
| * @param snapshot | ||
| * @param snapshot parent snapshot passed to child operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for adding the commenting. Just a nit, I think this should say something like "snapshot to apply the changes to". Child operations doesn't make too much sense to me imo.
I think we just need to say "Apply the changes with the given metadata to the snapshot".
| * | ||
| * @param branch which is name of SnapshotRef of type branch. | ||
| */ | ||
| ThisT toBranch(String branch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a default that throws UnsupportedOperationException.
| String.format( | ||
| "Cannot commit to branch %s: %s does not support branch commits", | ||
| branch, this.getClass().getName())); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the interface, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, fixed
|
Looks like this is close, but we need to fix the |
|
@namrathamyske raised a PR to your fork to address the revapi and moving the default implementation up to the interface level. https://github.com/namrathamyske/iceberg/pull/3/files |
…s unsupported. Allow new API in revapi
Core, API: Default implementation of toBranch on SnapshotUpdate throws unsupported.
|
@rdblue Merged PR with @amogh-jahagirdar's changes which had changes wrt to review comments. |
|
Thanks @jackye1995 for triggering the approval workflow. @rdblue when you get a chance, let us know how the change looks! |
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also looks good to me!
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @namrathamyske !
|
Looks like we are all in agreement, I will merge this to unblock next step implementations. thanks for the contribution @namrathamyske @amogh-jahagirdar ! and thanks for the review @rdblue ! |
This PR is related to the open issue #3896 to perform table operations on a branch ref. Right now throws UnsupportedOperationException. In laters PR's will include implementation for each operation.