-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Performing file deletion in ExpireSnapshots with Branching and Tagging
After the change in #4578 for updating the expire snapshots procedure to respect retention policies for branching and tagging, one significant limitation is that incremental file deletion as part of the procedure cannot be performed. The procedure will fail if cleaning expired files is set and there are multiple branches and tags.
Incremental file deletion cannot safely be performed when there are multiple branches because branching itself does not have visibility on what files can be removed; a reference set of "reachable" files has to be built from the metadata tree.
In previous community syncs this issue has come up, and wanted to discuss the approach for this:
1.) Update the remove snapshots API implementation (the part where the procedure determines which files are safe to delete) to build an in-memory reference set of reachable files across the retained branch snapshots and tags. This does pose a problem for large tables where the list of files would be too large to retain in memory on a single node, which brings us to point 2
2.) For users with really large tables, as discussed in a previous community sync, it can be reasonably assumed that they have Spark infrastructure for running an effective distributed procedure. Currently the Spark Procedure performs the metadata removal for removing snapshots https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java#L185, and the spark action itself takes the responsibility of doing an anti-join of the reachable files before and after the expiration, and the subsequent deletion.
The Spark procedure would also be updated for a better distributed file deletion procedure in the context of branching and tagging. We could refer (conceptually) to what Nessie is doing https://github.com/projectnessie/nessie/blob/main/gc/gc-base/src/main/java/org/projectnessie/gc/base/GCImpl.java#L58 for its Garbage collection implementation.
If there is consensus in the community on this plan, I'll start the implementation for 1, so that at least the limitation in deletion is removed. For 2, we may want a separate doc for the distributed algorithm proposal?
CC: @rdblue @jackye1995 @namrathamyske @aokolnychyi @RussellSpitzer
Query engine
No response