Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented May 10, 2022

Variation of #3457 that attempts to use time-travel on Manifest table to do the snapshot filtering. Demonstration, as it seems Manifest table does not support time travel yet.

@aokolnychyi
Copy link
Contributor

aokolnychyi commented May 18, 2022

I think this is an interesting idea. Let me summarize how I understand it.

Right now, we compute a diff between the reachability sets before and after snapshot expiry. Whenever we build the reachability set before the expiry, we read manifests of all snapshots and that seems suboptimal. The assumption is that it is sufficient to just build the reachability set for expired snapshots and compare that to the reachability set after the snapshot expiry. Did I get that right?

I guess one way to implement (and I believe this is what this PR tries to do) is to load the FILES metadata table for each expired snapshot and union DataFrames to produce the reachability set. One potential problem with that is that we will load the manifest list for every expired snapshot on the driver, which can become a bottleneck if we expire a lot of snapshots. I've seen such cases.

An alternative idea is to add some sort of snapshot-ids option to ALL_MANIFESTS metadata table and read only manifest lists for snapshots whose IDs are in that list. That way, we won't read manifest lists and manifests added after the last expired snapshot. That's already an optimization. However, we can go even further and remove still live manifests before opening them, which could give even a bigger performance boost. Suppose an expired snapshot references manifest-1 but we know it is still live. In that case, we can skip opening it up.

Does that seem reasonable?

@szehon-ho @RussellSpitzer @flyrain @rdblue @kbendick

@szehon-ho
Copy link
Member Author

szehon-ho commented May 18, 2022

Thanks for this complete thought, it makes sense to me. Yea I should have thought about the idea of passing a list of snapshot-ids as options, this would probably be a cleaner way to achieve this filtering to get expiredManifests.

And I like the second optimization as well (expiredManifests.except(allManifests)).

@aokolnychyi
Copy link
Contributor

@RussellSpitzer, could you cross-check the idea as you worked on this part in the past?

@rdblue
Copy link
Contributor

rdblue commented May 18, 2022

I think this is a really good idea, and I also like Anton's idea to filter manifests that are still live from the removed snapshots as well. Basically, we should be filtering the tree at every level (snapshot/manifest-list, manifest, files) before moving on to the next one.

  1. Open the old metadata file and find snapshots that are no longer in the current metadata
  2. Create a DF of the manifests table from all the expired snapshots
  3. Create a DF of the manifests table from all the current snapshots
  4. Remove any manifests from the expired set that are currently in the table (all files are still referenced)
  5. Remove any manifests from the current set that have no EXISTING files (cannot contain old files). Optional because files removed and re-added would not be caught.
  6. Transform the expired manifests DF to expired data files by reading each manifest
  7. Transform the current manifests DF to current data files by reading each manifest
  8. Remove any manifests from the expired data file set that are in the current data files set

Does that sound right? We may not want to do #6, but there is an opportunity to cut down on the number of manifests we read there by looking at the manifest file metadata.

@aokolnychyi
Copy link
Contributor

@rdblue, yeah, that's what I am thinking too. Do you refer to point 5 or 6 as optional? I guess 5. The current code reads only live entries so filtering manifests without live files seems safe. What's the use case with re-added files? Won't that be covered too as the snapshot with re-added files will be also considered?

@rdblue
Copy link
Contributor

rdblue commented May 18, 2022

Do you refer to point 5 or 6 as optional?

You're right: it's 5 that is optional. My mistake.

@RussellSpitzer
Copy link
Member

The manifest skipping idea sounds good to me, basically lets us cut down on the "except" cost as well.

@kbendick
Copy link
Contributor

Thanks for tagging me Anton.

I think this is a good idea as well.

I won’t repeat others but I have some questions and possibly a few additional ideas but I need to go through some theoretical cases on paper first.

One potential problem with that is that we will load the manifest list for every expired snapshot on the driver, which can become a bottleneck if we expire a lot of snapshots. I've seen such cases.

Two thoughts:

  1. We should add an event / metric describing this replanning work. Could be used as a signal to perform table maintenance.
  2. We might be able to track a metric to determine if we should do this initial replanning work in a distributed manner.

@szehon-ho
Copy link
Member Author

szehon-ho commented Jul 6, 2022

Hi @aokolnychyi @rdblue , update on this discussion (see main changes in #3457). I implemented and was playing around with these optimizations in calculating the delete candidate files: (skipping current snapshot_ids, and skipping current manifests). But I find actually that only this optimization :
all_manifest table.reference_snapshot_id in set(expired snapshots ids) makes a positive difference. The result is the same as what I saw initially, reduce time of Spark job 30-40% in the case of expiring 1 snapshot out of many snapshots.

I found in my scenario (expiring 1 snapshot), the skipping of valid manifests actually makes the performance worse.
As each manifest in all-manifest table is now marked with snapshot id, I think the first filter already filters out any current reachable manifests. So all this does is add one more Spark read of all_manifest table and a Spark join, decreasing the performance in this scenario.

EDIT: Was thinking that there is a scenario that skipping the current manifest will help is: The snapshots to expire point to a lot of data files that wont be deleted, because the manifests they point to are still live in the retained set of snapshots. But still, I feel the cost of adding the extra spark join may not be worth it in all cases. In the skip valid snapshot case, we can do a free in-memory filter pushdown without taking a Spark job, whereas it has to be an addition Spark join the way the current code works for reading all the manifests.

@github-actions
Copy link

github-actions bot commented Aug 9, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 9, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants