-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: Support rewriting delete files. #2294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I wrote an API in #2216 you may be interested. |
Thanks for the great work, I'd like to review that patch. |
|
Ping @rdblue @aokolnychyi @chenjunjiedada for reviewing, extending the RewriteFiles API is the first step for further format v2's rewrite actions (See the linked issues), thanks. |
| * @return this for method chaining | ||
| */ | ||
| RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd); | ||
| default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding RewriteFiles rewriteDeletes(Set<DeleteFile> fileToDelete, Set<DeleteFile> filesToAdd for rewrite deletes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to expose this API for replacing equality deletions with positional deletions, for me that seems like an internal usage we may don't have to define such a specific API for end users. I prefer to expose the following common API to end users, for our internal usage we could rewrite files based on that one.
| public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete, | ||
| Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) { | ||
| checkFilesToDelete(dataFilesToDelete, deleteFilesToDelete); | ||
| checkFilesToAdd(dataFilesToAdd, deleteFilesToAdd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The files to add seems could be empty since we have introduced the row-level deletes. For example, we insert <1, 'aaa'>. then delete <1, 'aaa'>, finally do the rewrite files action. In the end, there should be no file to add when rewriting.
yyanyy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, mostly minor comments
|
|
||
| Assert.assertEquals("Should contain 3 manifest", 3, pending.allManifests().size()); | ||
| Assert.assertFalse("Should not contain manifest from initial write", | ||
| pending.allManifests().containsAll(initialManifests)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to check if any manifest matches initial manifests by .allManifests.stream().anyMatch(initialManifests::contains) ?
| TableMetadata metadata = readMetadata(); | ||
| List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3); | ||
| Assert.assertTrue("Should committed the manifests", | ||
| metadata.currentSnapshot().allManifests().containsAll(committedManifests)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could do metadata.currentSnapshot().allManifests().equals(committedManifests)
| ManifestFile manifest3 = pending.allManifests().get(2); | ||
|
|
||
| validateManifestEntries(manifest1, | ||
| ids(baseSnapshotId), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we verify that sequence number for this is 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd better to add the seq number verification after we have reached agreement on this issue: #2308.
| metadata = readMetadata(); | ||
| List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3); | ||
| Assert.assertTrue("Should committed the manifests", | ||
| metadata.currentSnapshot().allManifests().containsAll(committedManifests)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: also equals here
| if (deleteFilesToDelete == null || deleteFilesToDelete.isEmpty()) { | ||
| // When there is no delete files in the rewrite action, data files to add cannot be null or empty. | ||
| Preconditions.checkArgument(dataFilesToAdd != null && dataFilesToAdd.size() > 0, | ||
| "Data files to add can not be null or empty because there's no delete file to rewrite"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: might want to update to something like "Data files to add can not be null or empty when there is no delete file to be rewritten", similar applies to L61
| } | ||
|
|
||
| /** | ||
| * Add a rewrite that replaces one set of files with another set that contains the same data (format v2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably don't need to mention format v2 specifically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed !
| Preconditions.checkArgument(filesToDelete != null && !filesToDelete.isEmpty(), | ||
| "Files to delete cannot be null or empty"); | ||
| Preconditions.checkArgument(filesToAdd != null && !filesToAdd.isEmpty(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are changing the logic now to not enforce input sets to be non-nullable? I think for the new code we can do a precondition check on the four input sets to ensure they are all non-null, to save all the null check everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are changing the logic now to not enforce input sets to be non-nullable?
Yes, in current version we are required to pass non-empty and non-null Set because we must ensure the input files and output files have some rows to rewrite & generate. After introducing the format v2, the data input files can be empty (if we plan to convert eq-deletes to pos-deletes), then I think we also don't have to require the user to pass a non-null empty set. Here passing a null or ImmutableMap.of() , both of them looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that they can be empty, I guess my major point is that since it is the new API we introduced here that allows empty input, we can enforce inputs to be not null by adding a precondition check at the beginning of the method to fail if null is passed in, so that we don't have to do all the != null everywhere to make the code a bit more clean, while still allowing ImmutableSet.of() which should logically be equivalent to null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think it's good to simplify this null check, just updated this patch.
|
Filed an issue to address the flakey unit test #2384, will retry the travis CI. |
|
Got this merged, thanks @yyanyy and @chenjunjiedada for reviewing. |
In iceberg format v2, we've introduced the row-level delete feature, which would introduce many delete files ( include positional delete files and insert data files). Currently the batch RewriteAction only works for iceberg format v2, say compacting data files only. Before expose the format v2 to end users, we need to check whether the common features are OK, we should provide the ability to rewrite both data files and delete files.
In this PR, I'm trying to extend the RewriteFiles API to work for both data files and delete files, so that we could implement the format v2's RewriteFiles Action.