Skip to content

Do not sanitize append vecs at startup#6552

Merged
brooksprumo merged 1 commit intoanza-xyz:masterfrom
brooksprumo:startup/no-storage-scan-at-startup
Jun 13, 2025
Merged

Do not sanitize append vecs at startup#6552
brooksprumo merged 1 commit intoanza-xyz:masterfrom
brooksprumo:startup/no-storage-scan-at-startup

Conversation

@brooksprumo
Copy link
Copy Markdown

@brooksprumo brooksprumo commented Jun 12, 2025

Problem

At startup from a snapshot, we must rebuild/reconstruct all the AppendVecs. This is basically relinking all the storage files on disk to the AccountStorageEntrys that were serialized into the bank snapshot (aka snapshot manifest). We use the storage file paths from the snapshot manifest to reconstruct the AppendVecs.

We also do some sanitization of the storage files when reconstructing the new AppendVecs. One step is to sanitize each account in the storage file. This involves scanning (i.e. reading) the whole file, and checking every single account to make sure the accounts are valid. Scanning every account in every storage file is rather time consuming. On mainnet-beta, we now have over 900 million accounts. Doing this sanitization step takes over 100 seconds.

An observation is that we also read all the accounts/storages two other times as part of startup. One is for index generation, and the other is for startup accounts verification. If the storages are wrong/invalid, then one (or both) of these other tasks will catch it. Thus, we can avoid the sanitization step during startup reconstruction.

Speeding up reconstructing the storages, speeds up startup. And if startup is faster, that means validators begin replaying faster. If validators begin replaying faster, they catch up faster, AND that also means there are fewer blocks they have to repair (vs getting through turbine).

Summary of Changes

Skip AppendVec accounts sanitization during startup when reconstructing storages.

Testing against mnb, I saw times drop from ~100-120 seconds to ~2 seconds for reconstructing storages at startup.

@brooksprumo brooksprumo self-assigned this Jun 12, 2025
@brooksprumo brooksprumo marked this pull request as ready for review June 12, 2025 21:50
/// This version of `new()` may only be called when reconstructing storages as part of startup.
/// It trusts the snapshot's value for `current_len`, and relies on later index generation or
/// accounts verification to ensure it is valid.
pub fn new_for_startup(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for new_from_file compared with new_from_startup?

I understand the difference between the two, but when will we want to actually use new_from_file instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, new_from_file() is now only called in tests, benches, and store-tool.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason it isn't marked DCOU?

Copy link
Copy Markdown
Author

@brooksprumo brooksprumo Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably can be. I wouldn't want to do that in this PR though.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It seems relevant to this PR: You are replacing the function with a new implementation and deprecating the old one for non-test usage.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not deprecating the old one though. It may currently be called only by dcou-stuff, but should we remove/deprecate/make-dcou it? I'm not sure; it seems like a normal/useful constructor, esp. since it is safe and can be called anywhere. I'd personally rather leave it for now. Maybe we revisit that? But that's why I don't want it within this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(drive by from bp pr) +1 deprecation or dcou

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #6894 for DCOU.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 86.95652% with 3 lines in your changes missing coverage. Please review.

Project coverage is 82.8%. Comparing base (fd8bc3a) to head (6ea4f26).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6552   +/-   ##
=======================================
  Coverage    82.8%    82.8%           
=======================================
  Files         849      849           
  Lines      379166   379187   +21     
=======================================
+ Hits       314064   314088   +24     
+ Misses      65102    65099    -3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

storage_access: StorageAccess,
) -> Result<Arc<AccountStorageEntry>, SnapshotError> {
let (accounts_file, _num_accounts) =
AccountsFile::new_from_file(append_vec_path, current_len, storage_access)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so new_from_file is used for only tests now?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCOU-only, yeah: #6552 (comment)

@brooksprumo brooksprumo requested review from HaoranYi and roryharr June 13, 2025 13:31
Copy link
Copy Markdown

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown

@HaoranYi HaoranYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@brooksprumo brooksprumo merged commit 3804f39 into anza-xyz:master Jun 13, 2025
28 checks passed
@brooksprumo brooksprumo deleted the startup/no-storage-scan-at-startup branch June 13, 2025 17:37
@mergify
Copy link
Copy Markdown

mergify Bot commented Jun 13, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify Bot pushed a commit that referenced this pull request Jun 13, 2025
(cherry picked from commit 3804f39)

# Conflicts:
#	runtime/src/serde_snapshot.rs
brooksprumo added a commit that referenced this pull request Jun 13, 2025
brooksprumo added a commit that referenced this pull request Jun 16, 2025
Do not sanitize append vecs at startup (#6552)

(cherry picked from commit 3804f39)

Co-authored-by: Brooks <brooks@anza.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants