Chores: using bulk delete if it's possible #5375

dungdm93 · 2022-07-28T17:53:31Z

Using batch delete (a.k.a bulk delete) to perform delete multiple files if FileIO implement SupportsBulkOperations
#4012

Signed-off-by: Đặng Minh Dũng <[email protected]>

szehon-ho

I think this is a good direction for me. Also would like @amogh-jahagirdar to take a look to see if the FileIOUtil makes sense here or not.

szehon-ho · 2022-08-01T18:00:20Z

core/src/main/java/org/apache/iceberg/util/FileIOUtil.java

+            .noRetry()
+            .executeWith(service)
+            .suppressFailureWhenFinished()
+            .onFailure((file, exc) -> LOG.warn("Delete failed for {}: {}", name, file, exc))


Are we missing a parameter?

Never mind about this.

szehon-ho · 2022-08-01T18:02:24Z

core/src/main/java/org/apache/iceberg/util/FileIOUtil.java

+      if (io instanceof SupportsBulkOperations) {
+        try {
+          SupportsBulkOperations bulkIO = (SupportsBulkOperations) io;
+          bulkIO.deleteFiles(files);


Should we use the configured 'service' pool?

szehon-ho · 2022-08-01T18:03:36Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java


  private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
-    Tasks.foreach(manifests)
+    FileIOUtil.bulkDeleteManifests(io, manifests)


Looks like we dont pass a name here. Thinking about it, if a name is always recommended, why not make it mandatory parameter of the bulkDelete API?

dramaticlly · 2022-08-01T23:49:29Z

Also want to share a similar change for expire-snapshots to use BulkDeleton: #5412

dungdm93 · 2022-08-02T15:56:02Z

@szehon-ho I'm also thinking about retry and service pool mechanism for bulkDelete.
There are 2 ways:

First option is let caller split delete files into chunks, and pass each chunk to bulkDelete API as the same way as normal delete.
```
Iteratable<List<String>> deleteFileChunked = ....
Tasks.foreach(files)
    .retry(3)
    .executeWith(service)
    .run(io::deleteFiles);
```
The drawback of this approach is
* chunk is calculate twice, one in the caller, one in FileIO (as S3FileIO implementation)
* other cons is if a request fails, there is noway to retry only failed files
Second option is bulkDelete API return a TaskLike object, then let caller decided how it should run (no of retry, executor service, error handler, etc...). Something like this
```
io.deleteFiles(files) // return a TaskLike object
  .retry(3)
  .executeWith(service)
  .onFailure(...)
  .execute()  // now the task will be executed
```

Second option will introduce a breaking change in bulkDelete API. However I still prefer this because it's using in no-where now.
WDTY?

szehon-ho · 2022-08-02T22:20:19Z

I think we need to coordinate with @amogh-jahagirdar who is also working on this on #5373. There seems a few prs trying to use the bulk delete, and see if we can push the retry logic to the FileIO?

dungdm93 · 2022-08-03T02:48:44Z

I don't think we should put retry logic into bulk delete. Because different client may wanna use different config. For examples BaseTransaction use noRetry while HiveIcebergRecordWriter use retry(3)

amogh-jahagirdar · 2022-08-03T03:11:10Z

@dungdm93 Not sure about comparing transactions and record writers but I think I understand what you're getting at. Another view on this (and I think this is what you are thinking, let me know if it's not) is that procedures should be in control of their retry strategy, failure handling and so on; not the FIleIO implementation because procedures/actions may want different configurations at different times.

My only doubt is that building a public Util which wraps around the existing Task framework mainly just for batch deletion seems heavy handed at this point in time. My feeling is that we shouldn't add public Util classes or any public kind of APIs unless we are really sure they will get used and so that's why I put it in the file IO implementation for simplicity; that seems more easily reversible in case folks want more configuration on retry behavior for different procedures. If we know that we want this customizability up front, then I think this makes sense. Right now looking at the implementation, it seems as though the desired delete behavior is basically the same, the only difference between the integrations is if we use concurrent execution or not (which I still think the fileIO layer should handle?)

Would like to get the communities thoughts! @dungdm93 @szehon-ho @danielcweeks @rdblue

szehon-ho · 2022-08-03T17:14:28Z

I don't think we should put retry logic into bulk delete. Because different client may wanna use different config. For examples BaseTransaction use noRetry while HiveIcebergRecordWriter use retry(3)

@dungdm93 Not sure about comparing transactions and record writers but I think I understand what you're getting at. Another view on this (and I think this is what you are thinking, let me know if it's not) is that procedures should be in control of their retry strategy, failure handling and so on; not the FIleIO implementation because procedures/actions may want different configurations at different times.

To me, if we have decided to push down the batching logic S3FileIO.deleteFiles() which seems the case, I'm not sure I see another way than for the S3FileIO to handle the retry logic. The caller cannot retry the whole thing in bulk, because some batches may succeeded and others failed, right? If thats the case, the only thing we can do then is to have the callers pass in different options for retry to S3FileIO, to preserve their original intent? For example expireSnapshot has retry (3), and these ones here have no retry, like you mentioned. Or maybe I'm not being imaginative enough.

github-actions · 2024-08-17T00:13:16Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-08-26T00:14:09Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added core spark labels Jul 28, 2022

dungdm93 force-pushed the bulk-delete branch from 39c0ce6 to b6f33e3 Compare July 28, 2022 18:02

Chores: using bulk delete AMAP

91a48f6

Signed-off-by: Đặng Minh Dũng <[email protected]>

dungdm93 force-pushed the bulk-delete branch from b6f33e3 to 91a48f6 Compare July 28, 2022 18:05

szehon-ho reviewed Aug 1, 2022

View reviewed changes

dungdm93 changed the title ~~Chores: using bulk delete AMAP~~ Chores: using bulk delete if it's possible Aug 2, 2022

dungdm93 requested a review from szehon-ho August 2, 2022 16:00

aokolnychyi mentioned this pull request Jan 24, 2023

Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes #6588

Closed

github-actions bot added the stale label Aug 17, 2024

github-actions bot closed this Aug 26, 2024

Chores: using bulk delete if it's possible #5375

Chores: using bulk delete if it's possible #5375

Uh oh!

Conversation

dungdm93 commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Aug 1, 2022

Choose a reason for hiding this comment

Uh oh!

szehon-ho Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

szehon-ho Aug 1, 2022

Choose a reason for hiding this comment

Uh oh!

szehon-ho Aug 1, 2022

Choose a reason for hiding this comment

Uh oh!

dramaticlly commented Aug 1, 2022

Uh oh!

dungdm93 commented Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Aug 2, 2022

Uh oh!

dungdm93 commented Aug 3, 2022

Uh oh!

amogh-jahagirdar commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Aug 3, 2022

Uh oh!

github-actions bot commented Aug 17, 2024

Uh oh!

github-actions bot commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dungdm93 commented Jul 28, 2022 •

edited

Loading

dungdm93 commented Aug 2, 2022 •

edited

Loading

amogh-jahagirdar commented Aug 3, 2022 •

edited

Loading