Skip to content

Use batch deletes in IcebergMetadata#14908

Merged
findepi merged 3 commits intotrinodb:masterfrom
alexjo2144:iceberg/batch-delete-files
Nov 9, 2022
Merged

Use batch deletes in IcebergMetadata#14908
findepi merged 3 commits intotrinodb:masterfrom
alexjo2144:iceberg/batch-delete-files

Conversation

@alexjo2144
Copy link
Copy Markdown
Member

Description

@homar recently added an API to the FileSystem API to delete a batch of files, if the underlying filesystem supports it. Leverage that API in places where many files are deleted at once.

Non-technical explanation

Leverage delete APIs in a more efficient way, if available.

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@alexjo2144 alexjo2144 added performance no-release-notes This pull request does not require release notes entry labels Nov 4, 2022
@alexjo2144 alexjo2144 force-pushed the iceberg/batch-delete-files branch from 3017ab6 to 2c87d6b Compare November 4, 2022 18:24
@alexjo2144 alexjo2144 requested a review from findepi November 4, 2022 18:24
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to make the multi-file deletion in batches of DELETE_BATCH_SIZE size.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it, but IMO it's not really necessary. In remove_orphan_files or expire_snapshots it's a good idea because we're loading the list of files to delete incrementally, and batching there prevents us from materializing the full list in memory. Here, we already have the list in memory, so we can rely on the file system to batch in sizes appropriate for the implementation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the way you did it here is not helpful. Actually I expected you will partition it on higher level -

fullyDeletedFiles.values().stream()
                        .flatMap(Collection::stream)
                        .map(CommitTaskData::getPath)

without collecting it to list first. But now, after I learnt that Lists.partition() is not lazy, I am not sure if this can be done...

Comment on lines 1908 to 1909
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit?
if the FS needs batches, why not let it do on its own?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with 2c87d6b#r1015558983 that the data list is already in memory (in some form)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep deleteBatch.add(location + "/" + fileName); and if (deleteBatch.size() >= DELETE_BATCH_ lines together

@alexjo2144 alexjo2144 force-pushed the iceberg/batch-delete-files branch from de82ed2 to ee7b6ed Compare November 7, 2022 19:12
@alexjo2144 alexjo2144 requested a review from findepi November 7, 2022 19:44
@findepi
Copy link
Copy Markdown
Member

findepi commented Nov 9, 2022

Build is green. Will squash and merge.

@findepi findepi force-pushed the iceberg/batch-delete-files branch from ee7b6ed to 241c2cf Compare November 9, 2022 14:27
@findepi findepi merged commit f5c5cbc into trinodb:master Nov 9, 2022
@github-actions github-actions bot added this to the 403 milestone Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed no-release-notes This pull request does not require release notes entry performance

Development

Successfully merging this pull request may close these issues.

4 participants