Skip to content

Performing file deletion in ExpireSnapshots with Branching and Tagging #5653

@amogh-jahagirdar

Description

@amogh-jahagirdar

Performing file deletion in ExpireSnapshots with Branching and Tagging

After the change in #4578 for updating the expire snapshots procedure to respect retention policies for branching and tagging, one significant limitation is that incremental file deletion as part of the procedure cannot be performed. The procedure will fail if cleaning expired files is set and there are multiple branches and tags.

Incremental file deletion cannot safely be performed when there are multiple branches because branching itself does not have visibility on what files can be removed; a reference set of "reachable" files has to be built from the metadata tree.

In previous community syncs this issue has come up, and wanted to discuss the approach for this:

1.) Update the remove snapshots API implementation (the part where the procedure determines which files are safe to delete) to build an in-memory reference set of reachable files across the retained branch snapshots and tags. This does pose a problem for large tables where the list of files would be too large to retain in memory on a single node, which brings us to point 2

2.) For users with really large tables, as discussed in a previous community sync, it can be reasonably assumed that they have Spark infrastructure for running an effective distributed procedure. Currently the Spark Procedure performs the metadata removal for removing snapshots https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java#L185, and the spark action itself takes the responsibility of doing an anti-join of the reachable files before and after the expiration, and the subsequent deletion.

The Spark procedure would also be updated for a better distributed file deletion procedure in the context of branching and tagging. We could refer (conceptually) to what Nessie is doing https://github.com/projectnessie/nessie/blob/main/gc/gc-base/src/main/java/org/projectnessie/gc/base/GCImpl.java#L58 for its Garbage collection implementation.

If there is consensus in the community on this plan, I'll start the implementation for 1, so that at least the limitation in deletion is removed. For 2, we may want a separate doc for the distributed algorithm proposal?

CC: @rdblue @jackye1995 @namrathamyske @aokolnychyi @RussellSpitzer

Query engine

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions