-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Chores: using bulk delete if it's possible #5375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Đặng Minh Dũng <[email protected]>
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good direction for me. Also would like @amogh-jahagirdar to take a look to see if the FileIOUtil makes sense here or not.
| .noRetry() | ||
| .executeWith(service) | ||
| .suppressFailureWhenFinished() | ||
| .onFailure((file, exc) -> LOG.warn("Delete failed for {}: {}", name, file, exc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind about this.
| if (io instanceof SupportsBulkOperations) { | ||
| try { | ||
| SupportsBulkOperations bulkIO = (SupportsBulkOperations) io; | ||
| bulkIO.deleteFiles(files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the configured 'service' pool?
|
|
||
| private static void deleteManifests(FileIO io, List<ManifestFile> manifests) { | ||
| Tasks.foreach(manifests) | ||
| FileIOUtil.bulkDeleteManifests(io, manifests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we dont pass a name here. Thinking about it, if a name is always recommended, why not make it mandatory parameter of the bulkDelete API?
|
Also want to share a similar change for expire-snapshots to use BulkDeleton: #5412 |
|
@szehon-ho I'm also thinking about retry and service pool mechanism for bulkDelete.
Second option will introduce a breaking change in bulkDelete API. However I still prefer this because it's using in no-where now. |
|
I think we need to coordinate with @amogh-jahagirdar who is also working on this on #5373. There seems a few prs trying to use the bulk delete, and see if we can push the retry logic to the FileIO? |
|
I don't think we should put retry logic into bulk delete. Because different client may wanna use different config. For examples |
|
@dungdm93 Not sure about comparing transactions and record writers but I think I understand what you're getting at. Another view on this (and I think this is what you are thinking, let me know if it's not) is that procedures should be in control of their retry strategy, failure handling and so on; not the FIleIO implementation because procedures/actions may want different configurations at different times. My only doubt is that building a public Util which wraps around the existing Task framework mainly just for batch deletion seems heavy handed at this point in time. My feeling is that we shouldn't add public Util classes or any public kind of APIs unless we are really sure they will get used and so that's why I put it in the file IO implementation for simplicity; that seems more easily reversible in case folks want more configuration on retry behavior for different procedures. If we know that we want this customizability up front, then I think this makes sense. Right now looking at the implementation, it seems as though the desired delete behavior is basically the same, the only difference between the integrations is if we use concurrent execution or not (which I still think the fileIO layer should handle?) Would like to get the communities thoughts! @dungdm93 @szehon-ho @danielcweeks @rdblue |
To me, if we have decided to push down the batching logic S3FileIO.deleteFiles() which seems the case, I'm not sure I see another way than for the S3FileIO to handle the retry logic. The caller cannot retry the whole thing in bulk, because some batches may succeeded and others failed, right? If thats the case, the only thing we can do then is to have the callers pass in different options for retry to S3FileIO, to preserve their original intent? For example expireSnapshot has retry (3), and these ones here have no retry, like you mentioned. Or maybe I'm not being imaginative enough. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Using batch delete (a.k.a bulk delete) to perform delete multiple files if
FileIOimplementSupportsBulkOperations#4012