WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests #4736

szehon-ho · 2022-05-10T01:47:51Z

Variation of #3457 that attempts to use time-travel on Manifest table to do the snapshot filtering. Demonstration, as it seems Manifest table does not support time travel yet.

…ired manifests

aokolnychyi · 2022-05-18T03:35:16Z

I think this is an interesting idea. Let me summarize how I understand it.

Right now, we compute a diff between the reachability sets before and after snapshot expiry. Whenever we build the reachability set before the expiry, we read manifests of all snapshots and that seems suboptimal. The assumption is that it is sufficient to just build the reachability set for expired snapshots and compare that to the reachability set after the snapshot expiry. Did I get that right?

I guess one way to implement (and I believe this is what this PR tries to do) is to load the FILES metadata table for each expired snapshot and union DataFrames to produce the reachability set. One potential problem with that is that we will load the manifest list for every expired snapshot on the driver, which can become a bottleneck if we expire a lot of snapshots. I've seen such cases.

An alternative idea is to add some sort of snapshot-ids option to ALL_MANIFESTS metadata table and read only manifest lists for snapshots whose IDs are in that list. That way, we won't read manifest lists and manifests added after the last expired snapshot. That's already an optimization. However, we can go even further and remove still live manifests before opening them, which could give even a bigger performance boost. Suppose an expired snapshot references manifest-1 but we know it is still live. In that case, we can skip opening it up.

Does that seem reasonable?

@szehon-ho @RussellSpitzer @flyrain @rdblue @kbendick

szehon-ho · 2022-05-18T08:31:10Z

Thanks for this complete thought, it makes sense to me. Yea I should have thought about the idea of passing a list of snapshot-ids as options, this would probably be a cleaner way to achieve this filtering to get expiredManifests.

And I like the second optimization as well (expiredManifests.except(allManifests)).

aokolnychyi · 2022-05-18T16:18:28Z

@RussellSpitzer, could you cross-check the idea as you worked on this part in the past?

rdblue · 2022-05-18T16:28:39Z

I think this is a really good idea, and I also like Anton's idea to filter manifests that are still live from the removed snapshots as well. Basically, we should be filtering the tree at every level (snapshot/manifest-list, manifest, files) before moving on to the next one.

Open the old metadata file and find snapshots that are no longer in the current metadata
Create a DF of the manifests table from all the expired snapshots
Create a DF of the manifests table from all the current snapshots
Remove any manifests from the expired set that are currently in the table (all files are still referenced)
Remove any manifests from the current set that have no EXISTING files (cannot contain old files). Optional because files removed and re-added would not be caught.
Transform the expired manifests DF to expired data files by reading each manifest
Transform the current manifests DF to current data files by reading each manifest
Remove any manifests from the expired data file set that are in the current data files set

Does that sound right? We may not want to do #6, but there is an opportunity to cut down on the number of manifests we read there by looking at the manifest file metadata.

aokolnychyi · 2022-05-18T17:24:32Z

@rdblue, yeah, that's what I am thinking too. Do you refer to point 5 or 6 as optional? I guess 5. The current code reads only live entries so filtering manifests without live files seems safe. What's the use case with re-added files? Won't that be covered too as the snapshot with re-added files will be also considered?

rdblue · 2022-05-18T18:58:56Z

Do you refer to point 5 or 6 as optional?

You're right: it's 5 that is optional. My mistake.

RussellSpitzer · 2022-05-18T21:11:02Z

The manifest skipping idea sounds good to me, basically lets us cut down on the "except" cost as well.

kbendick · 2022-05-19T02:20:10Z

Thanks for tagging me Anton.

I think this is a good idea as well.

I won’t repeat others but I have some questions and possibly a few additional ideas but I need to go through some theoretical cases on paper first.

One potential problem with that is that we will load the manifest list for every expired snapshot on the driver, which can become a bottleneck if we expire a lot of snapshots. I've seen such cases.

Two thoughts:

We should add an event / metric describing this replanning work. Could be used as a signal to perform table maintenance.
We might be able to track a metric to determine if we should do this initial replanning work in a distributed manner.

szehon-ho · 2022-07-06T23:30:39Z

Hi @aokolnychyi @rdblue , update on this discussion (see main changes in #3457). I implemented and was playing around with these optimizations in calculating the delete candidate files: (skipping current snapshot_ids, and skipping current manifests). But I find actually that only this optimization :
all_manifest table.reference_snapshot_id in set(expired snapshots ids) makes a positive difference. The result is the same as what I saw initially, reduce time of Spark job 30-40% in the case of expiring 1 snapshot out of many snapshots.

I found in my scenario (expiring 1 snapshot), the skipping of valid manifests actually makes the performance worse.
As each manifest in all-manifest table is now marked with snapshot id, I think the first filter already filters out any current reachable manifests. So all this does is add one more Spark read of all_manifest table and a Spark join, decreasing the performance in this scenario.

EDIT: Was thinking that there is a scenario that skipping the current manifest will help is: The snapshots to expire point to a lot of data files that wont be deleted, because the manifests they point to are still live in the retained set of snapshots. But still, I feel the cost of adding the extra spark join may not be worth it in all cases. In the skip valid snapshot case, we can do a free in-memory filter pushdown without taking a Spark job, whereas it has to be an addition Spark join the way the current code works for reading all the manifests.

github-actions · 2024-08-09T00:13:42Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-08-17T00:12:54Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Improve performance of expire snapshot by not double-scanning non-exp…

6e2f591

…ired manifests

github-actions bot added the spark label May 10, 2022

This was referenced May 10, 2022

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots #3457

Merged

Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions #4674

Closed

szehon-ho mentioned this pull request May 23, 2022

Core: Add reference_snapshot_id column and filter to AllManifest table #4847

Merged

github-actions bot added the stale label Aug 9, 2024

github-actions bot closed this Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests #4736

WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests #4736

Uh oh!

szehon-ho commented May 10, 2022 •

edited

Loading

Uh oh!

aokolnychyi commented May 18, 2022 •

edited

Loading

Uh oh!

szehon-ho commented May 18, 2022 •

edited

Loading

Uh oh!

aokolnychyi commented May 18, 2022

Uh oh!

rdblue commented May 18, 2022

Uh oh!

aokolnychyi commented May 18, 2022

Uh oh!

rdblue commented May 18, 2022

Uh oh!

RussellSpitzer commented May 18, 2022

Uh oh!

kbendick commented May 19, 2022

Uh oh!

szehon-ho commented Jul 6, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Aug 9, 2024

Uh oh!

github-actions bot commented Aug 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests #4736

WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests #4736

Uh oh!

Conversation

szehon-ho commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aokolnychyi commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aokolnychyi commented May 18, 2022

Uh oh!

rdblue commented May 18, 2022

Uh oh!

aokolnychyi commented May 18, 2022

Uh oh!

rdblue commented May 18, 2022

Uh oh!

RussellSpitzer commented May 18, 2022

Uh oh!

kbendick commented May 19, 2022

Uh oh!

szehon-ho commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 9, 2024

Uh oh!

github-actions bot commented Aug 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

szehon-ho commented May 10, 2022 •

edited

Loading

aokolnychyi commented May 18, 2022 •

edited

Loading

szehon-ho commented May 18, 2022 •

edited

Loading

szehon-ho commented Jul 6, 2022 •

edited

Loading