Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 13 additions & 4 deletions core/src/main/java/org/apache/iceberg/RemoveSnapshots.java
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
import org.apache.iceberg.io.CloseableIterable;
import org.apache.iceberg.relocated.com.google.common.base.Joiner;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
import org.apache.iceberg.relocated.com.google.common.collect.Lists;
import org.apache.iceberg.relocated.com.google.common.collect.Maps;
import org.apache.iceberg.relocated.com.google.common.collect.Sets;
Expand Down Expand Up @@ -366,11 +367,19 @@ private void removeExpiredFiles(
// Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete
// as much of the delete work as possible and avoid orphaned data or manifest files.

// this is the set of ancestors of the current table state. when removing snapshots, this must
// only remove files that were deleted in an ancestor of the current table state to avoid
// ToDo: This will be removed when reachability analysis is done so files across multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It should be TODO.

// branches can be removed
SnapshotRef branchToCleanup = Iterables.getFirst(base.refs().values(), null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the expectation is that only one ref exists. Either main or branch ref. What if it's a tag ref?

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is the following:

1.) Logically, a tagged snapshot would either need to exist on either a.) non-main branch b.) main-branch
2.) If the tag exists on main a file cleanup couldn't (currently) be done in the first place (because main cannot age off so we'd have multiple refs), so this point wouldn't have been reached
3.) If the tag exists on a non-main branch and the non-main branch ages off before the tagged snapshot which gets retained, then the tag ends up being de-facto "tip" of a lineage. In which case, the expiration logic would work as expected. If non-main branch still is retained, then we wouldn't reach this point (same case as 2, just that the other ref is the non-main branch).

Combining this with the fact that writes cannot be performed on tags leads me to believe that for purpose of expiration , specifically determining which files to delete, there's no need to differentiate tags and branches.

I could call this refToCleanup if that makes more sense to folks? But the only case where this is a tag is the case what I mentioned in 3.) in which case it's just a "dangling" snapshot which is referenced by a tag. @namrathamyske @rdblue @jackye1995

Also let me know if there's a flaw in my logic

if (branchToCleanup == null) {
return;
}

Snapshot branchTip = base.snapshot(branchToCleanup.snapshotId());

// this is the set of ancestors of the branch to cleanup. when removing snapshots, this must
// only remove files that were deleted in an ancestor of the branch to cleanup to avoid
// physically deleting files that were logically deleted in a commit that was rolled back.
Set<Long> ancestorIds =
Sets.newHashSet(SnapshotUtil.ancestorIds(base.currentSnapshot(), base::snapshot));
Set<Long> ancestorIds = Sets.newHashSet(SnapshotUtil.ancestorIds(branchTip, base::snapshot));

Set<Long> pickedAncestorSnapshotIds = Sets.newHashSet();
for (long snapshotId : ancestorIds) {
Expand Down