Serializes block_id into snapshot#11355
Conversation
| feature = "frozen-abi", | ||
| derive(AbiExample), | ||
| frozen_abi(digest = "7NL9Sugo2js3PtueK7izqSwX5dGWKDMDVnjo6MirVPWE") | ||
| frozen_abi(digest = "ELfWa1ABrJu9sQABjieyqnWioDaHNykbmYqsQr7yYYBb") |
There was a problem hiding this comment.
Digest changed because we added block_id.
❯ diff -u ./*7NL9Sugo2js3PtueK7izqSwX5dGWKDMDVnjo6MirVPWE* ./*ELfWa1ABrJu9sQABjieyqnWioDaHNykbmYqsQr7yYYBb*
--- ./bank__serde_snapshot__tests__test_bank_serialize__BankAbiTestWrapper_frozen_abi__test_api_digest_7NL9Sugo2js3PtueK7izqSwX5dGWKDMDVnjo6MirVPWE 2026-03-17 16:41:50
+++ ./bank__serde_snapshot__tests__test_bank_serialize__BankAbiTestWrapper_frozen_abi__test_api_digest_ELfWa1ABrJu9sQABjieyqnWioDaHNykbmYqsQr7yYYBb 2026-03-17 16:40:05
@@ -1275,7 +1275,7 @@
element u8
primitive u8
element solana_runtime::serde_snapshot::ExtraFieldsToSerialize
- struct ExtraFieldsToSerialize (fields = 5)
+ struct ExtraFieldsToSerialize (fields = 6)
field lamports_per_signature: u64
primitive u64
field unused_incremental_snapshot_persistence: core::option::Option<solana_runtime::serde_snapshot::UnusedIncrementalSnapshotPersistence>
@@ -4480,3 +4480,73 @@
primitive u16
element serde_with::ser::SerializeAsWrap<'_, u16, serde_with::Same>
primitive u16
+ field block_id: core::option::Option<solana_hash::Hash>
+ enum Option (variants = 2)
+ variant(0) None (unit)
+ variant(1) Some(solana_hash::Hash) (newtype)
+ struct Hash([u8; 32]) (newtype)
+ tuple (elements = 32)
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8
+ element u8
+ primitive u8| pub unused_epoch_accounts_hash: Option<Hash>, | ||
| pub versioned_epoch_stakes: HashMap<u64, VersionedEpochStakes>, | ||
| pub accounts_lt_hash: Option<SerdeAccountsLtHash>, | ||
| pub block_id: Option<Hash>, |
There was a problem hiding this comment.
This is the main change. We serialize block_id into the snapshot.
3c58c46 to
7da8a7f
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #11355 +/- ##
=========================================
- Coverage 83.3% 83.2% -0.1%
=========================================
Files 842 842
Lines 319350 319445 +95
=========================================
+ Hits 266022 266082 +60
- Misses 53328 53363 +35 🚀 New features to boost your workflow:
|
|
@roryharr - requesting your review on the snapshot parts |
7da8a7f to
0d7385e
Compare
| accounts_data_len: self.load_accounts_data_size(), | ||
| versioned_epoch_stakes: self.epoch_stakes.clone(), | ||
| accounts_lt_hash: self.accounts_lt_hash.lock().unwrap().clone(), | ||
| block_id: self.block_id().expect("block id must be set"), |
There was a problem hiding this comment.
@AshwinSekar - here we require that block_id is set in order to take a snapshot.
Codex told me this may not be safe though. Can you confirm, please? Here's its review comments:
Avoid requiring block_id on banks that still replay with None
On clusters where some peers still emit the legacy shred format, Blockstore::check_last_fec_set_and_get_block_id() legitimately returns Ok(None), and replay stores that None on rooted non-leader banks (the existing comments in Consensus::record_bank_vote call this out). This new expect means the accounts background service will panic as soon as one of those banks is selected for a full or incremental snapshot, because the normal validator snapshot path never backfills a synthetic ID the way the warp/ledger-tool paths now do.
There was a problem hiding this comment.
Legacy shreds have been removed on all live clusters and we explicitly drop such shreds on ingest.
We still have some code that refers to legacy shreds, but those are in the process of being ripped out.
Also if Blockstore::check_last_fec_set_and_get_block_id() fails we return an Err and the slot is marked dead meaning it can't be rooted and snapshotted in the first place.
| /// Note this fn will also freeze `bank`. | ||
| /// | ||
| /// Only to be called from dev contexts. | ||
| /// Couldn't make the fn actually DCOU, since it needs to be called by Validator::new(). |
There was a problem hiding this comment.
nit: can say it's called by Validator::new() when warping a slot
This functionality is only used for tests, (warp_slot doesn't have a cli flag and is only set in tests / test-validator), and can't really be used on a live cluster unless everyone does it.
maybe someday it could be put behind DCOU.
d318920 to
0de018b
Compare
|
Rebased onto tip of master to pull in #11351, which fixes failures in test_refresh_vote_eviction. |
| /// Note this fn will also freeze `bank`. | ||
| /// | ||
| /// Only to be called from dev contexts. | ||
| /// Couldn't make the fn actually DCOU, since it is called by |
There was a problem hiding this comment.
So is the issue here that warping a slot should in fact be dcou, but it's not due to how our crates are layed out?
Is this something that can be fixed by moving the test-validator to dev-bins?
There was a problem hiding this comment.
So is the issue here that warping a slot should in fact be dcou, but it's not due to how our crates are layed out?
I think the only use for warping is for dev context stuff, yes? I reserve the right to be wrong here though 😸
Is this something that can be fixed by moving the test-validator to dev-bins?
Maybe? I dunno. Just gotta put everything that calls calculate_and_set_block_id_for_dcou() into a DCOU block. Only moving test-validator doesn't do that, but may enable moving everything else. I haven't looked into this though.
There was a problem hiding this comment.
This seems like trying to put a bandaid over the real issue: That warp_slot is DCOU and isn't documented as so because no ones fixed it.
So it should be fixed first, which might provide some other benefits as well.
Or it's not actually DCOU and the _for_dcou should be removed.
There was a problem hiding this comment.
Yeah, might be able to refactor code to mark warping as dcou. IMO seems quite a large endeavor to block this PR on though. I can rename calculate_and_set_block_id_for_dcou() if that is sufficient.
There was a problem hiding this comment.
I guess it really depends how time pressing this feature is.
There was a problem hiding this comment.
This PR doesn't change any of the semantics around warping. I don't understand why it should be gated on solana-test-validator.
There was a problem hiding this comment.
From the other conversation, it sounds like warping in the validator can be removed entirely. Would that change this PR at all?
There was a problem hiding this comment.
If validator.rs maybe_warp_slot() can be removed (or put into a DCOU block), then I think calculate_and_set_block_id_for_dcou() should be able to be marked with DCOU too.
There was a problem hiding this comment.
Turning into a mess. Merge this and solve the DCOU issue later
0de018b to
92f033f
Compare
|
Rebased and force-pushed to resolve merge conflicts with master. |
AshwinSekar
left a comment
There was a problem hiding this comment.
good from my side, there's some follow up discussion on discord to remove warp slot from validator.rs which would let us make this helper a proper DCOU in the future
gregcusack
left a comment
There was a problem hiding this comment.
i think there is a bug in this PR. We have a testsuite testing tvu on a locally running validator that panicked once this PR landed:
thread 'solAcctsBgSvc' (4062564) panicked at runtime/src/bank.rs:2132:39:
block id must be set
Not super familiar with the bank/snapshot side of things so been using Codex to help me get a lead as to what is going on.
Re Brooks/Ashwin's convo here: https://github.com/anza-xyz/agave/pull/11355/changes#r2954328517
I think Ashwin’s point only rules out the old legacy-shred/non-leader case. It doesn’t prove the case that every snapshot-eligible bank has block_id set.
The current code still treats block_id as legitimately optional for leader banks. See:
Line 944 in fb1208f
None for leader blocks in core/src/replay_stage.rs here: agave/core/src/replay_stage.rs
Lines 3525 to 3558 in fb1208f
The snapshot path then reaches Bank::get_fields_to_serialize() from solAcctsBgSvc here:
agave/runtime/src/snapshot_package.rs
Line 45 in fb1208f
expect panics.
So it doesn't appear the expect is safe.
|
@gregcusack Yeah, that's the same code I was wondering about too. I was under the impression that |
|
I confirm that I experience problems after picking this change. So I'm running locally with this fix: #11631 |
|
PR #11613 is supposed to be the fix. |
Problem
block_idis not serialized into the snapshot.Deserialization support was added in #9644, which is in v4.0. It is now safe to serialize block id here, in v4.1.
This is for SIMD-3331.
Summary of Changes
Serialize block_id into snapshot.
Note that the block_id is required to Some when writing into the snapshot now. As such, many tests needed to be updated to set the bank's block_id field before taking a snapshot. If preferred, I could pull out the code that sets block_id into a separate PR. I didn't do that here since setting block_id isn't necessary without the snapshot changes.
Additional Testing
I ran this commit on a dev box, created snapshots, then restarted on v4.0. The node was successfully able to load the snapshot and startup.
Footnotes
https://github.com/solana-foundation/solana-improvement-documents/blob/main/proposals/0333-snapshot-include-block-id.md ↩