-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Support performing merge appends and delete files on branches #5618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b6fbc7d to
b5b5407
Compare
| Assert.assertEquals("Should have 1 manifest", 1, deleteSnapshot.allManifests(FILE_IO).size()); | ||
| validateManifestEntries( | ||
| deleteSnapshot.allManifests(FILE_IO).get(0), | ||
| ids(initialSnapshot.snapshotId(), deleteSnapshot.snapshotId()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In test cases initialSnapshot is always behind deleteSnapshot. Should there be a check when it's ahead of deleteSnapshot ? Not sure if it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry not sure what you mean by "ahead" and "behind"? Do you mean if we should validate the state of main after the branch writes?
In that case, it makes sense but I don't think its particularly necessary because the existing tests should make sure we're committing snapshots to the correct branch in the first place meaning the existing tests would guarantee us that the operations we perform are done on the correct lineage, and existing lineages would be untouched. That being said, no harm in adding more validations! @jackye1995 @rdblue any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree with what Amogh says, but we can always add a test to verify the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tests are specific enough that we don't need to validate the order of commits. Entries with Status.DELETED will only appear in a single manifest and are dropped if the manifest is rewritten. So validating DELETE entries can only succeed if the delete operation produced the manifest.
| } | ||
|
|
||
| @Test | ||
| public void testDeleteWithRowFilterWithCombinedPredicatesOnBranch() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the PR for RowDelta, can we parameterize this so that we test all of the existing cases on a branch that is independent from main? Then we only need tests for modifications to both main and a separate branch (to validate that they evolve independently).
We'll also need to update snapshot expiration tests to ensure that file deletion happens as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the comment on RowDelta: https://github.com/apache/iceberg/pull/5234/files#r953201851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've parameterized the tests, it does seem like for meaningful tests to be written here we also need to have an implementation for MergeAppend. So I added that.
7ac2d2f to
fe82186
Compare
|
The tests for delete files depend on merge appends for some test cases, and vice versa. So to have meaningful branch write tests, I've added the implementation for merge append as well and updated all the test cases to be parameterized so existing cases work for main and branch cases. I still need to follow up on updated expiration tests and then the tests which can test evolution of different branches independently. |
| public static Object[] parameters() { | ||
| return new Object[] {1, 2}; | ||
| return new Object[][] { | ||
| new Object[] {1, "main"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am little confused regarding parameters. Does each {1, "testBranch"} translate to {formatVersion, branchname}?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, we test for every combination of format version + branch. I'll need to update the "Parameters" annotation so that we can get nicer test names. The branch parameter gets injected as an argument in the test class constructor.
| Snapshot commit(Table tbl, SnapshotUpdate snapshotUpdate, String branch) { | ||
| Snapshot snapshot = null; | ||
| if (branch.equals(SnapshotRef.MAIN_BRANCH)) { | ||
| snapshotUpdate.commit(); | ||
| snapshot = tbl.currentSnapshot(); | ||
| } else { | ||
| ((SnapshotProducer) snapshotUpdate.toBranch(branch)).commit(); | ||
| snapshot = tbl.snapshot(tbl.refs().get(branch).snapshotId()); | ||
| } | ||
| return snapshot; | ||
| } | ||
|
|
||
| Snapshot apply(SnapshotUpdate snapshotUpdate, String branch) { | ||
| if (branch.equals(SnapshotRef.MAIN_BRANCH)) { | ||
| return ((SnapshotProducer) snapshotUpdate).apply(); | ||
| } else { | ||
| return ((SnapshotProducer) snapshotUpdate.toBranch(branch)).apply(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could combine these into a common method apply(Table tl, SnapshotUpdate snapshotUpdate, String branch, boolean commit) and then only commit if commit is true. but for now opting to keep it explicit and separate, unless there's any objections
8e9eda7 to
3507e4f
Compare
a728c40 to
2f828a4
Compare
| snapshotUpdate.commit(); | ||
| snapshot = tbl.currentSnapshot(); | ||
| } else { | ||
| ((SnapshotProducer) snapshotUpdate.toBranch(branch)).commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just pass a SnapshotProducer to this method? That would eliminate casting right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AppendFiles doesn't extend SnapshotProducer, so the casting would just have to shift to the caller.
| } | ||
| } | ||
|
|
||
| Snapshot currentSnapshot(Table tbl, String branch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about latest rather than current? I think that describes it better since current has a specific meaning (the latest on the main branch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to latest
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar, overall this is really close. I think that all the changes are correct and it does exactly the right thing to test branch commits by running all of the existing tests on both main and another branch. Nice work!
There are a few minor things to fix, but we should be able to get this in pretty quickly!
261722a to
f8928d8
Compare
f8928d8 to
7469f23
Compare
|
Since @rdblue also approved I will go ahead and merge this, thanks for the contribution @amogh-jahagirdar! And thanks for the review @rdblue ! |
|
Thanks for the review @rdblue @jackye1995 @namrathamyske ! |
I realized as I was implementing some of the snapshot producing operations for branches, that one operation that could be handled in a straightforward manner was delete files. It also helps with writing some of the test cases for the implementation of branch operations because for some of those tests we want to delete files on a given branch. So separating out this PR.
cc: @rdblue @jackye1995 @namrathamyske