Skip to content

v2.1: Marks old storages as dirty and uncleaned in clean_accounts() (backport of #3737)#3748

Merged
brooksprumo merged 2 commits intov2.1from
mergify/bp/v2.1/pr-3737
Nov 25, 2024
Merged

v2.1: Marks old storages as dirty and uncleaned in clean_accounts() (backport of #3737)#3748
brooksprumo merged 2 commits intov2.1from
mergify/bp/v2.1/pr-3737

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify Bot commented Nov 22, 2024

Problem

Copied from #3702

We do not clean up old storages.

More context: when calculating a full accounts hash, we call mark_old_slots_as_dirty() as a way to ensure we do not forget or miss cleaning up really old storages (i.e. ones that are older than an epoch old). But, when we enable skipping rewrites, we don't want to clean up those old storages, as they'll intentionally be treated as ancient append vecs. So inside mark_old_slots_as_dirty() we conditionally mark old slots as dirty. This is based on the value of ancient_append_vec_offset, which should be None unless ancient append vecs are enabled.

Unfortunately, normal running validators, we end up never marking old slots as dirty, because the ancient append vec offset is always Some. And thus we don't clean up old storages.

Summary of Changes

Mark old storages as dirty, and add to the uncleaned roots list in clean_accounts().

We still check if ancient append vecs are enabled, but not with the ancient_append_vec_offset. Instead we look at the skipping rewrites feature gate and the cli arg.

By moving this marking into clean_accounts(), we also decouple it from accounts hash calculation, which is not necessary anymore. This also removes behavioral differences based on if snapshots are enabled or not.

Justification to Backport

Without this fix, nodes may never clean up old account storage files, leading to eventual crashes due to running out of file descriptors/mmaps. There's also the general performance regressions that occur as these old account storage files are unexpectedly kept around forever.

Additional Testing

I started up a node running this PR, and used a snapshot containing over 800k account storage files. The node was quickly able to remove all the old storage files and resume normal behavior.

Here's a graph of the node's count of storages. It starts around 850k and quickly drops to the correct ~432k:
Screenshot 2024-11-22 at 8 11 08 PM


This is an automatic backport of pull request #3737 done by [Mergify](https://mergify.com).

(cherry picked from commit 31742ca)

# Conflicts:
#	accounts-db/src/accounts_db/tests.rs
@mergify mergify Bot requested a review from a team as a code owner November 22, 2024 18:04
@mergify mergify Bot added the conflicts label Nov 22, 2024
@mergify
Copy link
Copy Markdown
Author

mergify Bot commented Nov 22, 2024

Cherry-pick of 31742ca has failed:

On branch mergify/bp/v2.1/pr-3737
Your branch is up to date with 'origin/v2.1'.

You are currently cherry-picking commit 31742ca61e.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   accounts-db/src/accounts_db.rs
	modified:   runtime/src/bank.rs

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	deleted by us:   accounts-db/src/accounts_db/tests.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Copy link
Copy Markdown

@HaoranYi HaoranYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Copy link
Copy Markdown

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

Although it's hard for me to keep all these cases in my head at once 🫠 :

  • Clean old accounts based on ancient append vec offset for tests
  • some tests clean old accounts based on passing OldStoragesPolicy directly
  • Don't clean old accounts during shrink at startup
  • Clean old accounts if ancient storages are disabled (determined by feature gate + special accountsdb flag)

epoch_schedule,
// Leave any old storages alone for now. Once the validator is running
// normal, calls to clean_accounts() will have the correct policy based
// on if ancient storages are enabled or not.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it because we don't know what the right policy is at this point?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. we don't have a Bank instance to query. It's not wrong to use Leave here, but it could be potentially wrong to use Clean.

@brooksprumo
Copy link
Copy Markdown

Although it's hard for me to keep all these cases in my head at once 🫠 :

Yeah... it's a lot... Hence the difficulty in stabilizing skipping rewrites 😬

  • Clean old accounts based on ancient append vec offset for tests

Ancient append vecs will always use Leave. Once skipping rewrites is enabled, a lot of the complexity will go away.

  • some tests clean old accounts based on passing OldStoragesPolicy directly

Most tests don't have more than 432,000 slots+storages, so this value is not really an issue. Using Leave is the safest here.

  • Don't clean old accounts during shrink at startup

Yep... startup is special. Unfortunately we've added yet another special case...

  • Clean old accounts if ancient storages are disabled (determined by feature gate + special accountsdb flag)

This is the majority/common case. Getting this one right is what matters most.

@brooksprumo brooksprumo merged commit 970606e into v2.1 Nov 25, 2024
@brooksprumo brooksprumo deleted the mergify/bp/v2.1/pr-3737 branch November 25, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants