Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 23 additions & 3 deletions api/src/main/java/org/apache/iceberg/RewriteFiles.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

import java.util.Set;
import org.apache.iceberg.exceptions.ValidationException;
import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;

/**
* API for replacing files in a table.
Expand All @@ -35,11 +36,30 @@
*/
public interface RewriteFiles extends SnapshotUpdate<RewriteFiles> {
/**
* Add a rewrite that replaces one set of files with another set that contains the same data.
* Add a rewrite that replaces one set of data files with another set that contains the same data.
*
* @param filesToDelete files that will be replaced (deleted), cannot be null or empty.
* @param filesToAdd files that will be added, cannot be null or empty.
* @param filesToAdd files that will be added, cannot be null or empty.
* @return this for method chaining
*/
RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd);
default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding RewriteFiles rewriteDeletes(Set<DeleteFile> fileToDelete, Set<DeleteFile> filesToAdd for rewrite deletes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to expose this API for replacing equality deletions with positional deletions, for me that seems like an internal usage we may don't have to define such a specific API for end users. I prefer to expose the following common API to end users, for our internal usage we could rewrite files based on that one.

return rewriteFiles(
filesToDelete,
ImmutableSet.of(),
filesToAdd,
ImmutableSet.of()
);
}

/**
* Add a rewrite that replaces one set of files with another set that contains the same data.
*
* @param dataFilesToDelete data files that will be replaced (deleted).
* @param deleteFilesToDelete delete files that will be replaced (deleted).
* @param dataFilesToAdd data files that will be added.
* @param deleteFilesToAdd delete files that will be added.
* @return this for method chaining.
*/
RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd);
}
56 changes: 46 additions & 10 deletions core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
Original file line number Diff line number Diff line change
Expand Up @@ -40,19 +40,55 @@ protected String operation() {
return DataOperations.REPLACE;
}

private void verifyInputAndOutputFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
int filesToDelete = 0;
if (dataFilesToDelete != null) {
filesToDelete += dataFilesToDelete.size();
}

if (deleteFilesToDelete != null) {
filesToDelete += deleteFilesToDelete.size();
}

Preconditions.checkArgument(filesToDelete > 0, "Files to delete cannot be null or empty");

if (deleteFilesToDelete == null || deleteFilesToDelete.isEmpty()) {
// When there is no delete files in the rewrite action, data files to add cannot be null or empty.
Preconditions.checkArgument(dataFilesToAdd != null && dataFilesToAdd.size() > 0,
"Data files to add can not be null or empty because there's no delete file to be rewritten");
Preconditions.checkArgument(deleteFilesToAdd == null || deleteFilesToAdd.isEmpty(),
"Delete files to add must be null or empty because there's no delete file to be rewritten");
}
}

@Override
public RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
Preconditions.checkArgument(filesToDelete != null && !filesToDelete.isEmpty(),
"Files to delete cannot be null or empty");
Preconditions.checkArgument(filesToAdd != null && !filesToAdd.isEmpty(),
Comment on lines -45 to -47
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are changing the logic now to not enforce input sets to be non-nullable? I think for the new code we can do a precondition check on the four input sets to ensure they are all non-null, to save all the null check everywhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are changing the logic now to not enforce input sets to be non-nullable?

Yes, in current version we are required to pass non-empty and non-null Set because we must ensure the input files and output files have some rows to rewrite & generate. After introducing the format v2, the data input files can be empty (if we plan to convert eq-deletes to pos-deletes), then I think we also don't have to require the user to pass a non-null empty set. Here passing a null or ImmutableMap.of() , both of them looks good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they can be empty, I guess my major point is that since it is the new API we introduced here that allows empty input, we can enforce inputs to be not null by adding a precondition check at the beginning of the method to fail if null is passed in, so that we don't have to do all the != null everywhere to make the code a bit more clean, while still allowing ImmutableSet.of() which should logically be equivalent to null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think it's good to simplify this null check, just updated this patch.

"Files to add can not be null or empty");

for (DataFile toDelete : filesToDelete) {
delete(toDelete);
public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
verifyInputAndOutputFiles(dataFilesToDelete, deleteFilesToDelete, dataFilesToAdd, deleteFilesToAdd);

if (dataFilesToDelete != null) {
for (DataFile dataFile : dataFilesToDelete) {
delete(dataFile);
}
}

if (deleteFilesToDelete != null) {
for (DeleteFile deleteFile : deleteFilesToDelete) {
delete(deleteFile);
}
}

if (dataFilesToAdd != null) {
for (DataFile dataFile : dataFilesToAdd) {
add(dataFile);
}
}

for (DataFile toAdd : filesToAdd) {
add(toAdd);
if (deleteFilesToAdd != null) {
for (DeleteFile deleteFile : deleteFilesToAdd) {
add(deleteFile);
}
}

return this;
Expand Down
8 changes: 8 additions & 0 deletions core/src/test/java/org/apache/iceberg/TableTestBase.java
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,14 @@ public class TableTestBase {
.withPartitionPath("data_bucket=0") // easy way to set partition data for now
.withRecordCount(1)
.build();
// Equality delete files.
static final DeleteFile FILE_A2_DELETES = FileMetadata.deleteFileBuilder(SPEC)
.ofEqualityDeletes(3)
.withPath("/path/to/data-a2-deletes.parquet")
.withFileSizeInBytes(10)
.withPartitionPath("data_bucket=0")
.withRecordCount(1)
.build();
static final DataFile FILE_B = DataFiles.builder(SPEC)
.withPath("/path/to/data-b.parquet")
.withFileSizeInBytes(10)
Expand Down
Loading