-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Bulk deletion in RemoveSnapshots #11837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Bulk deletion in RemoveSnapshots #11837
Conversation
0c875f9 to
aafd0fa
Compare
|
Slack discussion about this: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1733215233582339 |
87083d5 to
d0638e5
Compare
|
I'm going to suggest some tests of failure handling to see what happens there
|
Hi @steveloughran , |
|
Hi @amogh-jahagirdar , |
|
@amogh-jahagirdar Would you mind taking a look. This came from a Slack discussion we had earlier. cc @pvary in case you have some capacity for this. |
| * safe. | ||
| */ | ||
| public class BulkDeleteConsumer implements Consumer<String> { | ||
| private final List<String> files = Lists.newArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit afraid that this list could become quite big.
Could we "flush" the delete in batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also make for a slow delete at the end. Ideally there'd be a page size for deletes, say 1000, and then kick off the delete in a separate thread.
Both S3A and S3FileIO have a configurable page size; s3a bulk delete is also rate limited per bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to flushing in batches, the list of files can be quite large for snapshot expiration. I think having a constant 1000 is fine to begin with.
I don't think it's really strictly required to kick of the delete in a separate thread, and would prefer to keep it simple at least for now. We generally are performing bulk deletes in maintenance operations which are already long running and a good chunk of that time is spent in CPU/memory bound computations of which files to delete rather than actually doing deletion.
If it's a real issue I'd table that as an optimization for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just checked the CatalogUtil.deleteFiles() functionality and apparently it does the batching by gathering the delete files by manifest and do a batch delete for each manifest separately if I understand it right. Let me see if instead of implementing a batching logic here, is it feasible to do the 'per manifest' batching.
Just for the record, I've checked S3FileIO and that seems to take care of the batching of the deletes. I'll have to make more investigation on this but for me it seems that we'd do double batching without much gain, just extra code complexity.
| return; | ||
| } | ||
|
|
||
| ops.deleteFiles(files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do retry, error handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry: no, they should do it themselves. If you add a layer of retry on top of their code, you simply double wrap the failures for exponential delays before giving up.
Do not try and be clever here. Look at the S3A connector policy, recognise how complicated it is, different policies for connectivity vs throttle vs other errors, what can be retried, how long to wait/backoff, etc etc.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java
double wrapping retries is a real PITA, it's bad enough that the V2 SDK has taken to retrying some things (UnknownHostException) that it never used to...doing stuff in the app makes things work.
regarding error handling: what is there to do other than report an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original delete path had retries. See:
Tasks.foreach(pathsToDelete)
.executeWith(deleteExecutorService)
.retry(3)
.stopRetryOn(NotFoundException.class)
.stopOnFailure()
.suppressFailureWhenFinished()
.onFailure(
(file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
.run(deleteFunc::accept);
I think we should match the original behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I agree with @pvary that to begin with we should probably mimic the existing delete retry behavior. In terms of error handling the deletion is all best effort. No operation should be impeded due to failure to physically delete a file off disk.
Though I understand @steveloughran point that double wrapping retries is not good either since we're essentially retrying 3 * num_sdk_retries on every retryable failure which just keeps clients up for unnecessarily long.
I think there's a worthwhile discussion to be had though in a follow on if we want to tune these retry behaviors in it's entirety to account for clients already performing retries.
I also don't know what the other clients such as Azure/GCS do in terms of automatic retries (since we want whatever is here to generalize across other systems).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the observations! I know the sequential delete approach here does retries, however I also checked CatalogUtil.deleteFiles() (that is used by the dropTable() functionality) and apparently there we don't do retries.
I checked the DeleteOrphanFileSperAction too and apparently that also doesn't do retries for neither the bulk nor the sequential deletes.
Also it doesn't do bulking, it leaves it to the FileIO (relevant for the other question on this PR).
Additionally, if bulk deletion fails, it's possible that the subset of the files were deleted, however we won't know which just get a number of failures from the Exception. So in case of partial success/failure if we retry the bulk delete with the same list of paths it will fail again and again.
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented; no actual code suggestions
| * safe. | ||
| */ | ||
| public class BulkDeleteConsumer implements Consumer<String> { | ||
| private final List<String> files = Lists.newArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also make for a slow delete at the end. Ideally there'd be a page size for deletes, say 1000, and then kick off the delete in a separate thread.
Both S3A and S3FileIO have a configurable page size; s3a bulk delete is also rate limited per bucket.
| return; | ||
| } | ||
|
|
||
| ops.deleteFiles(files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry: no, they should do it themselves. If you add a layer of retry on top of their code, you simply double wrap the failures for exponential delays before giving up.
Do not try and be clever here. Look at the S3A connector policy, recognise how complicated it is, different policies for connectivity vs throttle vs other errors, what can be retried, how long to wait/backoff, etc etc.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java
double wrapping retries is a real PITA, it's bad enough that the V2 SDK has taken to retrying some things (UnknownHostException) that it never used to...doing stuff in the app makes things work.
regarding error handling: what is there to do other than report an error?
|
Sorry for the late followup, I'm taking a look! |
| * Consumer class to collect file paths one by one and perform a bulk deletion on them. Not thread | ||
| * safe. | ||
| */ | ||
| public class BulkDeleteConsumer implements Consumer<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public? It'd be ideal if this can be package private and encapsulated in the core places where it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that even users can have an implementation of this so that they can plug in consumers that are capable of using the bulk delete interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving this a second thought, and after checking the DeleteOrphanFilesSparkAction code, I think I overcomplicated things here. So in DeleteOrphanFilesSparkAction the deleteFunc is also pluggable, however it keeps it simple:
If there is a plugged deleteFunc (that is a Consumer), use it
If there is no plugged deleteFunc and the fileIO supports bulk delete, then do bulk delete
otherwise do sequential delete.
See: https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L253
The difference is that for expire snapshots the deleteFunc won't be null, but that's a minor thing I think.
| * safe. | ||
| */ | ||
| public class BulkDeleteConsumer implements Consumer<String> { | ||
| private final List<String> files = Lists.newArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to flushing in batches, the list of files can be quite large for snapshot expiration. I think having a constant 1000 is fine to begin with.
I don't think it's really strictly required to kick of the delete in a separate thread, and would prefer to keep it simple at least for now. We generally are performing bulk deletes in maintenance operations which are already long running and a good chunk of that time is spent in CPU/memory bound computations of which files to delete rather than actually doing deletion.
If it's a real issue I'd table that as an optimization for later.
| return; | ||
| } | ||
|
|
||
| ops.deleteFiles(files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I agree with @pvary that to begin with we should probably mimic the existing delete retry behavior. In terms of error handling the deletion is all best effort. No operation should be impeded due to failure to physically delete a file off disk.
Though I understand @steveloughran point that double wrapping retries is not good either since we're essentially retrying 3 * num_sdk_retries on every retryable failure which just keeps clients up for unnecessarily long.
I think there's a worthwhile discussion to be had though in a follow on if we want to tune these retry behaviors in it's entirety to account for clients already performing retries.
I also don't know what the other clients such as Azure/GCS do in terms of automatic retries (since we want whatever is here to generalize across other systems).
|
Thanks for taking a look @pvary, @amogh-jahagirdar and @steveloughran ! I'll be offline for a couple of days but will take a deeper look after that. Another thing we discussed with Peter apart from the comments is that the current PR abuses the concept of a I also responded to the current comments. |
d0638e5 to
f28fb78
Compare
|
Hi @amogh-jahagirdar , @pvary , I managed to simplify the original PR and uploaded a new version. I took a deeper look at how orphan file deletion does the same thing and I basically follow the same approach here too. Additionally, there were comments about retries and batching the files. I believe these are not needed here, and this is also inline with orphan file cleanup. Let me know what you think! |
|
Why did you decide against an util method? The code is really very similar to the orphan file removal stuff |
I figured that first we should agree on the approach, because last time we were talking about retries and batch deletes. So once the approach is fine we could think about refactoring this and moving code to a common util. However, I checked the code and apparently there are multiple places where we apply a very similar logic so I thought that we might want to merge this PR without the refactor, and then as a next step I could take a look at moving all these occurrences into a common util class. WDYT? |
| @SuppressWarnings("checkstyle:VisibilityModifier") | ||
| abstract class FileCleanupStrategy { | ||
| private final Consumer<String> defaultDeleteFunc = | ||
| new Consumer<String>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: String is not needed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
I'm fine with the current approach. |
|
|
||
| @TestTemplate | ||
| public void testRemoveFromTableWithBulkIO() { | ||
| TestBulkLocalFileIO spyFileIO = Mockito.spy(new TestBulkLocalFileIO()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this approach.
If we create our own FileIO operation, why do we use Mockio to spy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just use a TestFileIO which does the counting, and then we can forget about the Mockito in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner to use a Mockito spy on these functions and perform the verification steps on the spy. The custom FileIO was needed to have a test FileIO that derives from SupportsBulkOperations and has a deleteFiles() function. For sure we could also introduce a Set in our FileIO implementation that collects the paths received as param, also we could introduce a counter that counts how many time the deleteFiles() function was called, etc., and also we could separate the received paths by call of this function, but that is unnecessarily written code as Mockito already gives this for us.
See L1700-1703: we verify that deleteFiles() was called 3 times, and we could verify the given paths broken down by each call of the function. With this we could see that even in a scenario where we get an exception during deletion, we still call the deleteFiles() function for all the other file types too.
I prefer the current implementation compared to write custom code for verification, but I might miss something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use something like this then:
SupportsBulkOperations bulkFileIO = Mockito.mock(SupportsBulkOperations.class, withSettings().extraInterfaces(FileIO.class));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, @pvary ! I tried this out, but apparently it's not suitable for the tests to have a FileIO completely mocked, it would need at least a TestFileIO instance to be able to make operations on the test tables. In this case a very early step (tableWithBulkIO.newAppend().appendFile(FILE_A).commit();) fails if there is no FileIO instance just a mock under the table.
The current implementation uses the deleteFile() of the FileIO even if it supports bulk operations. Even though the user of the RemoveSnapshots API can provide a custom Consumer to perform bulk deletion, Iceberg can be clever enough itself to find out if bulk deletion is possible on the FileIO.
f28fb78 to
944a32b
Compare
|
LGTM +1 |
|
Merged to main. |
The current implementation uses the deleteFile() of the FileIO even if it supports bulk operations. Even though the user of the RemoveSnapshots API can provide a custom Consumer to perform bulk deletion, Iceberg can be clever enough itself to find out if bulk deletion is possible on the FileIO.