Skip to content

Allow executing optimize procedure for Iceberg v2 table#12351

Merged
ebyhr merged 1 commit intotrinodb:masterfrom
ebyhr:ebi/iceberg-optimize-v2
May 27, 2022
Merged

Allow executing optimize procedure for Iceberg v2 table#12351
ebyhr merged 1 commit intotrinodb:masterfrom
ebyhr:ebi/iceberg-optimize-v2

Conversation

@ebyhr
Copy link
Member

@ebyhr ebyhr commented May 12, 2022

Description

Allow optimizing v2 table if delete files don't exist in Iceberg

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) Release notes entries required with the following suggested text:

# Iceberg
* Allow executing `optimize` procedure for v2 table. ({issue}`12351`)

@cla-bot cla-bot bot added the cla-signed label May 12, 2022
@findepi
Copy link
Member

findepi commented May 12, 2022

Allow optimizing v2 table if delete files don't exist in Iceberg

What would it take to extend this to support v2 table with deletes?

cc @alexjo2144

@ebyhr
Copy link
Member Author

ebyhr commented May 13, 2022

I tried committing delete-files beforehand like this, but it still failed "Cannot commit, found new delete for replaced data file" error.

        DeleteFiles deleteFiles = transaction.newDelete();
        for (DeleteFile deletedFile : deletedFiles) {
            deleteFiles.deleteFile(deletedFile.path());
        }
        deleteFiles.commit();

        RewriteFiles rewriteFiles = transaction.newRewrite();
...

@findepi
Copy link
Member

findepi commented May 13, 2022

@ebyhr how do we know which delete files we can delete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please write similar test or modify this in such a way that deletion is not the last operation ? So delete files are not part of the current snapshot?

@alexjo2144
Copy link
Member

alexjo2144 commented May 13, 2022

I tried committing delete-files beforehand like this, but it still failed "Cannot commit, found new delete for replaced data file" error.
...

I think you want the RewriteFiles#rewriteFiles method that takes 4 args.

RewriteFiles rewriteFiles = transaction.newRewrite();
rewriteFiles.rewriteFiles(scannedFiles, scannedDeleteFiles, newFiles, newDeleteFiles);
rewriteFiles.validateFromSnapshot(readSnapshotId).
rewriteFiles.commit();

@alexjo2144
Copy link
Member

alexjo2144 commented May 13, 2022

@ebyhr how do we know which delete files we can delete?

This could be tricky since delete-files may have deletes for multiple data-files. Some of which may be rewritten by optimize and some of which might not.

We skip scanning files during optimize if they're bigger than a given size, but we could change that to scan them files if they are small or have any delete files. That way we ensure that when the procedure completes all delete files can be removed.

@findepi
Copy link
Member

findepi commented May 16, 2022

we could change that to scan them files if they are small or have any delete files

yes

That way we ensure that when the procedure completes all delete files can be removed.

no, because there may be additional filters

This could be tricky since delete-files may have deletes for multiple data-files.

yes, tricky. What would happen if we don't delete any deletion files? Will Iceberg library prune them?

@alexjo2144
Copy link
Member

no, because there may be additional filters

Right now alter table execute filters have to describe a partition, they can't be more specific than that. So we should be able to remove any delete files in that partition.

yes, tricky. What would happen if we don't delete any deletion files? Will Iceberg library prune them?

You won't be able to commit a snapshot in this state. The validations ensure that the data-files a delete-file references exist in the current snapshot.

@findepi
Copy link
Member

findepi commented May 16, 2022

The validations ensure that the data-files a delete-file references exist in the current snapshot.

Okay, but that requires reading all manifest files.
If we do this, we can as well find out the deletion files to forget about. Basically same logic as the validation, just a different conclusion.

@ebyhr ebyhr force-pushed the ebi/iceberg-optimize-v2 branch from 006e40e to c17aebd Compare May 24, 2022 09:40
@alexjo2144
Copy link
Member

You won't be able to commit a snapshot in this state. The validations ensure that the data-files a delete-file references exist in the current snapshot.

I was wrong about the validations that run here, you can commit a table in this state, it just has a dangling delete-file that is never used.

@findepi
Copy link
Member

findepi commented May 25, 2022

Test failure is related

TestIcebergSparkCompatibility > testOptimizeFailsOnV2IcebergTable [groups: profile_specific_tests, iceberg]
Expected callback to throw QueryExecutionException

@ebyhr ebyhr changed the title Allow optimizing v2 table if delete files don't exist in Iceberg Allow executing optimize procedure for Iceberg v2 table May 26, 2022
@ebyhr ebyhr force-pushed the ebi/iceberg-optimize-v2 branch from bd86253 to 525e071 Compare May 26, 2022 13:06
@ebyhr
Copy link
Member Author

ebyhr commented May 26, 2022

@findepi @homar Updated.

@ebyhr ebyhr force-pushed the ebi/iceberg-optimize-v2 branch 2 times, most recently from 0d0ce0e to ab6fd14 Compare May 27, 2022 01:28
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep, just change to 3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the condition, just remove this comment.
We don't know what Iceberg v3 brings us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// org.apache.iceberg.Table.snapshot method returns null if there is no matching snapshot
// Table.snapshot method returns null if there is no matching snapshot

@ebyhr ebyhr force-pushed the ebi/iceberg-optimize-v2 branch from ab6fd14 to 98e631d Compare May 27, 2022 09:25
@ebyhr ebyhr merged commit 11a148f into trinodb:master May 27, 2022
@ebyhr ebyhr deleted the ebi/iceberg-optimize-v2 branch May 27, 2022 12:13
@github-actions github-actions bot added this to the 383 milestone May 27, 2022
@ebyhr ebyhr mentioned this pull request May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants