Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented Mar 4, 2021

In iceberg format v2, we've introduced the row-level delete feature, which would introduce many delete files ( include positional delete files and insert data files). Currently the batch RewriteAction only works for iceberg format v2, say compacting data files only. Before expose the format v2 to end users, we need to check whether the common features are OK, we should provide the ability to rewrite both data files and delete files.

In this PR, I'm trying to extend the RewriteFiles API to work for both data files and delete files, so that we could implement the format v2's RewriteFiles Action.

@chenjunjiedada
Copy link
Collaborator

I wrote an API in #2216 you may be interested.

@openinx
Copy link
Member Author

openinx commented Mar 4, 2021

I wrote an API in #2216 you may be interested.

Thanks for the great work, I'd like to review that patch.

@openinx
Copy link
Member Author

openinx commented Mar 8, 2021

Ping @rdblue @aokolnychyi @chenjunjiedada for reviewing, extending the RewriteFiles API is the first step for further format v2's rewrite actions (See the linked issues), thanks.

* @return this for method chaining
*/
RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd);
default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding RewriteFiles rewriteDeletes(Set<DeleteFile> fileToDelete, Set<DeleteFile> filesToAdd for rewrite deletes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to expose this API for replacing equality deletions with positional deletions, for me that seems like an internal usage we may don't have to define such a specific API for end users. I prefer to expose the following common API to end users, for our internal usage we could rewrite files based on that one.

public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
checkFilesToDelete(dataFilesToDelete, deleteFilesToDelete);
checkFilesToAdd(dataFilesToAdd, deleteFilesToAdd);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The files to add seems could be empty since we have introduced the row-level deletes. For example, we insert <1, 'aaa'>. then delete <1, 'aaa'>, finally do the rewrite files action. In the end, there should be no file to add when rewriting.

Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, mostly minor comments


Assert.assertEquals("Should contain 3 manifest", 3, pending.allManifests().size());
Assert.assertFalse("Should not contain manifest from initial write",
pending.allManifests().containsAll(initialManifests));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to check if any manifest matches initial manifests by .allManifests.stream().anyMatch(initialManifests::contains) ?

TableMetadata metadata = readMetadata();
List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3);
Assert.assertTrue("Should committed the manifests",
metadata.currentSnapshot().allManifests().containsAll(committedManifests));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could do metadata.currentSnapshot().allManifests().equals(committedManifests)

ManifestFile manifest3 = pending.allManifests().get(2);

validateManifestEntries(manifest1,
ids(baseSnapshotId),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we verify that sequence number for this is 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better to add the seq number verification after we have reached agreement on this issue: #2308.

metadata = readMetadata();
List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3);
Assert.assertTrue("Should committed the manifests",
metadata.currentSnapshot().allManifests().containsAll(committedManifests));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also equals here

if (deleteFilesToDelete == null || deleteFilesToDelete.isEmpty()) {
// When there is no delete files in the rewrite action, data files to add cannot be null or empty.
Preconditions.checkArgument(dataFilesToAdd != null && dataFilesToAdd.size() > 0,
"Data files to add can not be null or empty because there's no delete file to rewrite");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: might want to update to something like "Data files to add can not be null or empty when there is no delete file to be rewritten", similar applies to L61

}

/**
* Add a rewrite that replaces one set of files with another set that contains the same data (format v2).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably don't need to mention format v2 specifically

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed !

Comment on lines -45 to -47
Preconditions.checkArgument(filesToDelete != null && !filesToDelete.isEmpty(),
"Files to delete cannot be null or empty");
Preconditions.checkArgument(filesToAdd != null && !filesToAdd.isEmpty(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are changing the logic now to not enforce input sets to be non-nullable? I think for the new code we can do a precondition check on the four input sets to ensure they are all non-null, to save all the null check everywhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are changing the logic now to not enforce input sets to be non-nullable?

Yes, in current version we are required to pass non-empty and non-null Set because we must ensure the input files and output files have some rows to rewrite & generate. After introducing the format v2, the data input files can be empty (if we plan to convert eq-deletes to pos-deletes), then I think we also don't have to require the user to pass a non-null empty set. Here passing a null or ImmutableMap.of() , both of them looks good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they can be empty, I guess my major point is that since it is the new API we introduced here that allows empty input, we can enforce inputs to be not null by adding a precondition check at the beginning of the method to fail if null is passed in, so that we don't have to do all the != null everywhere to make the code a bit more clean, while still allowing ImmutableSet.of() which should logically be equivalent to null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think it's good to simplify this null check, just updated this patch.

@openinx
Copy link
Member Author

openinx commented Mar 26, 2021

Filed an issue to address the flakey unit test #2384, will retry the travis CI.

@openinx openinx closed this Mar 26, 2021
@openinx openinx reopened this Mar 26, 2021
@openinx
Copy link
Member Author

openinx commented Mar 29, 2021

Got this merged, thanks @yyanyy and @chenjunjiedada for reviewing.

@openinx openinx merged commit 7a0954f into apache:master Mar 29, 2021
@openinx openinx deleted the rewrite-v2 branch March 29, 2021 03:11
coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021
stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an action to rewrite data files and remove deleted rows

3 participants