Gets snapshot storages before calling verify_accounts_hash()#1202
Merged
brooksprumo merged 1 commit intoMay 7, 2024
Merged
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1202 +/- ##
=========================================
- Coverage 82.1% 82.1% -0.1%
=========================================
Files 886 886
Lines 236417 236442 +25
=========================================
- Hits 194266 194256 -10
- Misses 42151 42186 +35 |
brooksprumo
commented
May 6, 2024
Comment on lines
+5623
to
+5630
| // The snapshot storages must be captured *before* starting the background verification. | ||
| // Otherwise, it is possible that a delayed call to `get_snapshot_storages()` will *not* | ||
| // get the correct storages required to calculate and verify the accounts hashes. | ||
| let snapshot_storages = self | ||
| .rc | ||
| .accounts | ||
| .accounts_db | ||
| .get_snapshot_storages(RangeFull); |
Author
There was a problem hiding this comment.
Get the snapshot storages here, before starting the background thread to verify the account hashes.
|
|
||
| assert_matches!( | ||
| db.verify_accounts_hash_and_lamports(some_slot, 1, None, config.clone()), | ||
| db.verify_accounts_hash_and_lamports_for_tests(some_slot, 1, config.clone()), |
Author
There was a problem hiding this comment.
I missed this one in the previous PR.
| )) | ||
| assert!(accounts | ||
| .accounts_db | ||
| .verify_accounts_hash_and_lamports_for_tests( |
Author
There was a problem hiding this comment.
The benchmark needed to be updated too.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On pop256, one of the nodes kept failing during accounts hash verification. The root cause was due to the incremental accounts hash calculation getting wrong snapshot storages to use for the actual calculation.
Accounts hash verification happens in the background, and getting the snapshot storages is concurrent with normal node activity. Verification also checks the full accounts hash first; so by the time incremental accounts hash is checked, it has been many minutes later on pop256, which has over 5 billion accounts.
Getting the snapshot storages happens lazily—right before calculating each accounts hash. Thus, calculating the incremental accounts hash doesn't get the snapshot storages it needs until the node has been running for a while, and
cleanhas run many times. In this case, the storages were marked as dead, and thus the IAH did not get all the storages it needed for the calculation.We need to get the snapshot storages before calling
verify/before the foreground processing begins, to ensure we have all the correct storages.Summary of Changes
Pass in the snapshot storages as a function parameter to
verify_accounts_hash_and_lamports()as the fix for startup verification accounts hash mismatch issues due to storages getting cleaned away prematurely.This change fixed the issue seen on pop256.