Skip to content

Conversation

@dungdm93
Copy link
Contributor

@dungdm93 dungdm93 commented Jul 28, 2022

Using batch delete (a.k.a bulk delete) to perform delete multiple files if FileIO implement SupportsBulkOperations
#4012

Signed-off-by: Đặng Minh Dũng <[email protected]>
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good direction for me. Also would like @amogh-jahagirdar to take a look to see if the FileIOUtil makes sense here or not.

.noRetry()
.executeWith(service)
.suppressFailureWhenFinished()
.onFailure((file, exc) -> LOG.warn("Delete failed for {}: {}", name, file, exc))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing a parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind about this.

if (io instanceof SupportsBulkOperations) {
try {
SupportsBulkOperations bulkIO = (SupportsBulkOperations) io;
bulkIO.deleteFiles(files);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the configured 'service' pool?


private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
Tasks.foreach(manifests)
FileIOUtil.bulkDeleteManifests(io, manifests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we dont pass a name here. Thinking about it, if a name is always recommended, why not make it mandatory parameter of the bulkDelete API?

@dramaticlly
Copy link
Contributor

Also want to share a similar change for expire-snapshots to use BulkDeleton: #5412

@dungdm93 dungdm93 changed the title Chores: using bulk delete AMAP Chores: using bulk delete if it's possible Aug 2, 2022
@dungdm93
Copy link
Contributor Author

dungdm93 commented Aug 2, 2022

@szehon-ho I'm also thinking about retry and service pool mechanism for bulkDelete.
There are 2 ways:

  • First option is let caller split delete files into chunks, and pass each chunk to bulkDelete API as the same way as normal delete.
    Iteratable<List<String>> deleteFileChunked = ....
    Tasks.foreach(files)
        .retry(3)
        .executeWith(service)
        .run(io::deleteFiles);
    The drawback of this approach is
    * chunk is calculate twice, one in the caller, one in FileIO (as S3FileIO implementation)
    * other cons is if a request fails, there is noway to retry only failed files
  • Second option is bulkDelete API return a TaskLike object, then let caller decided how it should run (no of retry, executor service, error handler, etc...). Something like this
    io.deleteFiles(files) // return a TaskLike object
      .retry(3)
      .executeWith(service)
      .onFailure(...)
      .execute()  // now the task will be executed

Second option will introduce a breaking change in bulkDelete API. However I still prefer this because it's using in no-where now.
WDTY?

@dungdm93 dungdm93 requested a review from szehon-ho August 2, 2022 16:00
@szehon-ho
Copy link
Member

I think we need to coordinate with @amogh-jahagirdar who is also working on this on #5373. There seems a few prs trying to use the bulk delete, and see if we can push the retry logic to the FileIO?

@dungdm93
Copy link
Contributor Author

dungdm93 commented Aug 3, 2022

I don't think we should put retry logic into bulk delete. Because different client may wanna use different config. For examples BaseTransaction use noRetry while HiveIcebergRecordWriter use retry(3)

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Aug 3, 2022

@dungdm93 Not sure about comparing transactions and record writers but I think I understand what you're getting at. Another view on this (and I think this is what you are thinking, let me know if it's not) is that procedures should be in control of their retry strategy, failure handling and so on; not the FIleIO implementation because procedures/actions may want different configurations at different times.

My only doubt is that building a public Util which wraps around the existing Task framework mainly just for batch deletion seems heavy handed at this point in time. My feeling is that we shouldn't add public Util classes or any public kind of APIs unless we are really sure they will get used and so that's why I put it in the file IO implementation for simplicity; that seems more easily reversible in case folks want more configuration on retry behavior for different procedures. If we know that we want this customizability up front, then I think this makes sense. Right now looking at the implementation, it seems as though the desired delete behavior is basically the same, the only difference between the integrations is if we use concurrent execution or not (which I still think the fileIO layer should handle?)

Would like to get the communities thoughts! @dungdm93 @szehon-ho @danielcweeks @rdblue

@szehon-ho
Copy link
Member

I don't think we should put retry logic into bulk delete. Because different client may wanna use different config. For examples BaseTransaction use noRetry while HiveIcebergRecordWriter use retry(3)

@dungdm93 Not sure about comparing transactions and record writers but I think I understand what you're getting at. Another view on this (and I think this is what you are thinking, let me know if it's not) is that procedures should be in control of their retry strategy, failure handling and so on; not the FIleIO implementation because procedures/actions may want different configurations at different times.

To me, if we have decided to push down the batching logic S3FileIO.deleteFiles() which seems the case, I'm not sure I see another way than for the S3FileIO to handle the retry logic. The caller cannot retry the whole thing in bulk, because some batches may succeeded and others failed, right? If thats the case, the only thing we can do then is to have the callers pass in different options for retry to S3FileIO, to preserve their original intent? For example expireSnapshot has retry (3), and these ones here have no retry, like you mentioned. Or maybe I'm not being imaginative enough.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 17, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants