Skip to content

Serializes block_id into snapshot#11355

Merged
brooksprumo merged 1 commit intoanza-xyz:masterfrom
brooksprumo:simd333/snap/serialize
Mar 21, 2026
Merged

Serializes block_id into snapshot#11355
brooksprumo merged 1 commit intoanza-xyz:masterfrom
brooksprumo:simd333/snap/serialize

Conversation

@brooksprumo
Copy link
Copy Markdown

@brooksprumo brooksprumo commented Mar 17, 2026

Problem

block_id is not serialized into the snapshot.

Deserialization support was added in #9644, which is in v4.0. It is now safe to serialize block id here, in v4.1.

This is for SIMD-3331.

Summary of Changes

Serialize block_id into snapshot.

Note that the block_id is required to Some when writing into the snapshot now. As such, many tests needed to be updated to set the bank's block_id field before taking a snapshot. If preferred, I could pull out the code that sets block_id into a separate PR. I didn't do that here since setting block_id isn't necessary without the snapshot changes.

Additional Testing

I ran this commit on a dev box, created snapshots, then restarted on v4.0. The node was successfully able to load the snapshot and startup.

Footnotes

  1. https://github.com/solana-foundation/solana-improvement-documents/blob/main/proposals/0333-snapshot-include-block-id.md

@brooksprumo brooksprumo self-assigned this Mar 17, 2026
feature = "frozen-abi",
derive(AbiExample),
frozen_abi(digest = "7NL9Sugo2js3PtueK7izqSwX5dGWKDMDVnjo6MirVPWE")
frozen_abi(digest = "ELfWa1ABrJu9sQABjieyqnWioDaHNykbmYqsQr7yYYBb")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digest changed because we added block_id.

❯ diff -u ./*7NL9Sugo2js3PtueK7izqSwX5dGWKDMDVnjo6MirVPWE* ./*ELfWa1ABrJu9sQABjieyqnWioDaHNykbmYqsQr7yYYBb*
--- ./bank__serde_snapshot__tests__test_bank_serialize__BankAbiTestWrapper_frozen_abi__test_api_digest_7NL9Sugo2js3PtueK7izqSwX5dGWKDMDVnjo6MirVPWE	2026-03-17 16:41:50
+++ ./bank__serde_snapshot__tests__test_bank_serialize__BankAbiTestWrapper_frozen_abi__test_api_digest_ELfWa1ABrJu9sQABjieyqnWioDaHNykbmYqsQr7yYYBb	2026-03-17 16:40:05
@@ -1275,7 +1275,7 @@
                                                 element u8
                                                     primitive u8
             element solana_runtime::serde_snapshot::ExtraFieldsToSerialize
-                struct ExtraFieldsToSerialize (fields = 5)
+                struct ExtraFieldsToSerialize (fields = 6)
                     field lamports_per_signature: u64
                         primitive u64
                     field unused_incremental_snapshot_persistence: core::option::Option<solana_runtime::serde_snapshot::UnusedIncrementalSnapshotPersistence>
@@ -4480,3 +4480,73 @@
                                             primitive u16
                                         element serde_with::ser::SerializeAsWrap<'_, u16, serde_with::Same>
                                             primitive u16
+                    field block_id: core::option::Option<solana_hash::Hash>
+                        enum Option (variants = 2)
+                            variant(0) None (unit)
+                            variant(1) Some(solana_hash::Hash) (newtype)
+                                struct Hash([u8; 32]) (newtype)
+                                    tuple (elements = 32)
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8
+                                        element u8
+                                            primitive u8

pub unused_epoch_accounts_hash: Option<Hash>,
pub versioned_epoch_stakes: HashMap<u64, VersionedEpochStakes>,
pub accounts_lt_hash: Option<SerdeAccountsLtHash>,
pub block_id: Option<Hash>,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change. We serialize block_id into the snapshot.

@brooksprumo brooksprumo force-pushed the simd333/snap/serialize branch 3 times, most recently from 3c58c46 to 7da8a7f Compare March 18, 2026 12:32
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 95.91837% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.2%. Comparing base (aa2b33e) to head (92f033f).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #11355     +/-   ##
=========================================
- Coverage    83.3%    83.2%   -0.1%     
=========================================
  Files         842      842             
  Lines      319350   319445     +95     
=========================================
+ Hits       266022   266082     +60     
- Misses      53328    53363     +35     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@brooksprumo brooksprumo marked this pull request as ready for review March 18, 2026 14:37
@brooksprumo
Copy link
Copy Markdown
Author

@roryharr - requesting your review on the snapshot parts
@AshwinSekar - requesting your review on the block_id parts, especially the logic within calculate_and_set_block_id_for_dcou().

@brooksprumo brooksprumo force-pushed the simd333/snap/serialize branch from 7da8a7f to 0d7385e Compare March 18, 2026 15:28
Comment thread runtime/src/bank.rs
accounts_data_len: self.load_accounts_data_size(),
versioned_epoch_stakes: self.epoch_stakes.clone(),
accounts_lt_hash: self.accounts_lt_hash.lock().unwrap().clone(),
block_id: self.block_id().expect("block id must be set"),
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshwinSekar - here we require that block_id is set in order to take a snapshot.

Codex told me this may not be safe though. Can you confirm, please? Here's its review comments:

Avoid requiring block_id on banks that still replay with None

On clusters where some peers still emit the legacy shred format, Blockstore::check_last_fec_set_and_get_block_id() legitimately returns Ok(None), and replay stores that None on rooted non-leader banks (the existing comments in Consensus::record_bank_vote call this out). This new expect means the accounts background service will panic as soon as one of those banks is selected for a full or incremental snapshot, because the normal validator snapshot path never backfills a synthetic ID the way the warp/ledger-tool paths now do.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy shreds have been removed on all live clusters and we explicitly drop such shreds on ingest.
We still have some code that refers to legacy shreds, but those are in the process of being ripped out.

Also if Blockstore::check_last_fec_set_and_get_block_id() fails we return an Err and the slot is marked dead meaning it can't be rooted and snapshotted in the first place.

Copy link
Copy Markdown

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good from my side

Comment thread runtime/src/bank.rs
Comment thread runtime/src/bank.rs Outdated
/// Note this fn will also freeze `bank`.
///
/// Only to be called from dev contexts.
/// Couldn't make the fn actually DCOU, since it needs to be called by Validator::new().
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can say it's called by Validator::new() when warping a slot

This functionality is only used for tests, (warp_slot doesn't have a cli flag and is only set in tests / test-validator), and can't really be used on a live cluster unless everyone does it.

maybe someday it could be put behind DCOU.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d318920.

Comment thread runtime/src/bank.rs
@brooksprumo brooksprumo force-pushed the simd333/snap/serialize branch from d318920 to 0de018b Compare March 18, 2026 16:48
@brooksprumo
Copy link
Copy Markdown
Author

Rebased onto tip of master to pull in #11351, which fixes failures in test_refresh_vote_eviction.

@brooksprumo brooksprumo requested a review from AshwinSekar March 18, 2026 17:30
Comment thread runtime/src/bank.rs
/// Note this fn will also freeze `bank`.
///
/// Only to be called from dev contexts.
/// Couldn't make the fn actually DCOU, since it is called by
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the issue here that warping a slot should in fact be dcou, but it's not due to how our crates are layed out?

Is this something that can be fixed by moving the test-validator to dev-bins?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the issue here that warping a slot should in fact be dcou, but it's not due to how our crates are layed out?

I think the only use for warping is for dev context stuff, yes? I reserve the right to be wrong here though 😸

Is this something that can be fixed by moving the test-validator to dev-bins?

Maybe? I dunno. Just gotta put everything that calls calculate_and_set_block_id_for_dcou() into a DCOU block. Only moving test-validator doesn't do that, but may enable moving everything else. I haven't looked into this though.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like trying to put a bandaid over the real issue: That warp_slot is DCOU and isn't documented as so because no ones fixed it.

So it should be fixed first, which might provide some other benefits as well.

Or it's not actually DCOU and the _for_dcou should be removed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, might be able to refactor code to mark warping as dcou. IMO seems quite a large endeavor to block this PR on though. I can rename calculate_and_set_block_id_for_dcou() if that is sufficient.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it really depends how time pressing this feature is.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't change any of the semantics around warping. I don't understand why it should be gated on solana-test-validator.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the other conversation, it sounds like warping in the validator can be removed entirely. Would that change this PR at all?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If validator.rs maybe_warp_slot() can be removed (or put into a DCOU block), then I think calculate_and_set_block_id_for_dcou() should be able to be marked with DCOU too.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning into a mess. Merge this and solve the DCOU issue later

@brooksprumo brooksprumo requested a review from roryharr March 18, 2026 17:56
@brooksprumo brooksprumo force-pushed the simd333/snap/serialize branch from 0de018b to 92f033f Compare March 19, 2026 13:52
@brooksprumo
Copy link
Copy Markdown
Author

Rebased and force-pushed to resolve merge conflicts with master.

Copy link
Copy Markdown

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good from my side, there's some follow up discussion on discord to remove warp slot from validator.rs which would let us make this helper a proper DCOU in the future

@brooksprumo brooksprumo added this pull request to the merge queue Mar 21, 2026
Merged via the queue into anza-xyz:master with commit d5e3f34 Mar 21, 2026
51 checks passed
@brooksprumo brooksprumo deleted the simd333/snap/serialize branch March 21, 2026 14:58
Copy link
Copy Markdown

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there is a bug in this PR. We have a testsuite testing tvu on a locally running validator that panicked once this PR landed:

thread 'solAcctsBgSvc' (4062564) panicked at runtime/src/bank.rs:2132:39:
block id must be set

Not super familiar with the bank/snapshot side of things so been using Codex to help me get a lead as to what is going on.

Re Brooks/Ashwin's convo here: https://github.com/anza-xyz/agave/pull/11355/changes#r2954328517

I think Ashwin’s point only rules out the old legacy-shred/non-leader case. It doesn’t prove the case that every snapshot-eligible bank has block_id set.

The current code still treats block_id as legitimately optional for leader banks. See:

/// None for banks that have not yet completed replay or for leader banks as we cannot populate block_id
. Replay still stores None for leader blocks in core/src/replay_stage.rs here:

agave/core/src/replay_stage.rs

Lines 3525 to 3558 in fb1208f

let block_id = if !is_leader_block {
// If the block does not have at least DATA_SHREDS_PER_FEC_BLOCK correctly retransmitted
// shreds in the last FEC set, mark it dead. No reason to perform this check on our leader block.
match process_active_banks_context
.blockstore
.check_last_fec_set_and_get_block_id(
bank.slot(),
bank.hash(),
&bank.feature_set,
) {
Ok(block_id) => block_id,
Err(result_err) => {
let root = bank_forks.read().unwrap().root();
Self::mark_dead_slot(
&process_active_banks_context.blockstore,
bank,
root,
&result_err,
process_active_banks_context.rpc_subscriptions.as_deref(),
process_active_banks_context.slot_status_notifier.as_ref(),
progress,
duplicate_slots_to_repair,
&process_active_banks_context.ancestor_hashes_replay_update_sender,
purge_repair_slot_counter,
&mut tbft_structs,
&process_active_banks_context.replay_vote_sender,
);
continue;
}
}
} else {
None
};
bank.set_block_id(block_id);
.

The snapshot path then reaches Bank::get_fields_to_serialize() from solAcctsBgSvc here:

let bank_fields_to_serialize = bank.get_fields_to_serialize();
, where this new expect panics.

So it doesn't appear the expect is safe.

@brooksprumo
Copy link
Copy Markdown
Author

@gregcusack Yeah, that's the same code I was wondering about too.

I was under the impression that block_id must be set before it gets rooted. Is it that a leader sometimes doesn't get block_id, or that is never sets block_id?

@KirillLykov
Copy link
Copy Markdown

I confirm that I experience problems after picking this change. So I'm running locally with this fix: #11631

@brooksprumo
Copy link
Copy Markdown
Author

PR #11613 is supposed to be the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants