startup: purge shreds w/ wrong shred version inclusive of WFSM slot by AshwinSekar · Pull Request #12343 · anza-xyz/agave

AshwinSekar · 2026-05-07T18:10:28Z

Problem

There is an issue related to how SIMD-0340 validate chained block id interacts with a WFSM cluster restart from a snapshot.

When performing this on the alpenglow community cluster we ran into the following issue during replay:

WARN  solana_ledger::blockstore_processor] Chained merkle root mismatch for slot 1 (parent 0): child chains to 11111111111111111111111111111111, but parent block ID is CQ7oqugg4WKFENMCLeSAREEQ7K3F1UiJQq7PJHM1ygh1

As a recap, SIMD-0340 verifies that the chained merkle root specified at the beginning of block S corresponds to the block id of parent(S). This check is currently not enforced for the first block after a snapshot as shreds from the snapshot slot might not be available (or even exist in case of an artificial snapshot), and broadcast will default to using 111..11 in this case

In this setup we have the following:

Slot 0 does have shreds from genesis on some nodes (Tim and I's bootstrap nodes)

Slot 0 (root)
   num_shreds: 32, parent_slot: Some(0), next_slots: [1, 174], num_entries: 64, is_full: true

The block id of this slot is CQ7oqugg4WKFENMCLeSAREEQ7K3F1UiJQq7PJHM1ygh1 the merkle root of the final shred

Follower nodes do not have shreds for genesis

 Slot 0 (root)
num_shreds: 0, parent_slot: None, next_slots: [], num_entries: 0, is_full: false

Instead everyone has a snapshot on slot 0, which has block id equal to bank hash.

    block_id: Some(
      2pM9pWtQcWQY4MuRhvCtNpFjBDZMxeNyDsusY2xT8K49,
  )

The reason this block id is different is because we erroneously do not set block id on slot 0. SIMD-0333 ensures that all snapshots (even artificial ones generated by ledger-tool) have a block_id field set. Thus it defaults to the bank hash.

This combination of things lead to the leader (not bootstrap node) building a block that chains to 111..11. Normally this would be fine as we are restarting from a snapshot, however for Tim and I's node this genesis snapshot also had shreds in blockstore, so we were comparing this to CQ7oqugg4WKFENMCLeSAREEQ7K3F1UiJQq7PJHM1ygh1 and failing.

(not critical) ensure that bank 0 has the block id set.
(critical) ensure that we do not have shreds for the WFSM slot when restarting from a snapshot

We should backport this fix before enabling SIMD-0340 as any cluster restart where we have outstanding shreds for the artificial snapshot slot is at risk.

Note: in the ag community cluster this would have only forked off Tim and I's node, however there was also a participant that panicked out of the gate and caused a WFSM false start.

Summary of Changes

Set the block id for bank 0 correctly
Do the startup purge of shreds inclusive of the hard forks slot, as a ledger-tool created snapshot could have a different block id than the shreds for that slot:

E.g. suppose we create a ledger tool snapshot for an artificial bank at slot 5
Startup will only purge from 6 onwards with incorrect shred version
This means we will have conflicting shreds & snapshot for slot 5

AshwinSekar · 2026-05-07T18:25:06Z

confirmed this patch that my node was able to make progress

steviez · 2026-05-07T19:09:27Z

I'm reading through this but can you elaborate on what you mean when you say "artificial snapshot" ? What exactly is artificial about it; taking a snapshot at any root should be allowed. Maybe you're referring to snapshots when we do stuff like feature deactivations where we end up creating the child bank + advancing slot by 1

bw-solana · 2026-05-07T20:09:00Z

~~looks like you need to modify test_hard_fork_invalidates_tower to change the hard fork slot~~

steviez

(not critical) ensure that bank 0 has the block id set.

This seems like the bug no ? If the snapshot sets the block ID for bank 0 correctly can we avoid the change to blockstore purging altogether ?

(critical) ensure that we do not have shreds for the WFSM slot when restarting from a snapshot

As noted elsewhere, I believe this to be a non-starter. The shreds in the blockstore are valid / correct so there should be no need to purge them

steviez · 2026-05-07T20:06:02Z

    let maybe_cluster_restart_slot = maybe_cluster_restart_with_hard_fork(config, root_slot);
    if maybe_cluster_restart_slot.is_some() {
-        return Ok(Some(root_slot + 1));
+        return Ok(Some(root_slot));


I don't think we want to change this logic for the general case

Suppose the cluster halts with slot S as the latest OC; slot S will be picked as the cluster restart snapshot slot. Changing this logic means this function will return Ok(Some(S)). We will then purge slot S which means nodes will have a hole in their blockstore for slot S` ... this is bad for RPC nodes

agree. test_hard_fork_invalidates_tower failure is basically catching this

bw-solana

Should the invariant be “purge only wrong-version” or “purge all" shreds for the snapshot restart slot?

I believe these should mostly be the same, but is it possible for some operator to get new version shreds in the WFSM slot and then get the wrong answer for Block ID?

bw-solana · 2026-05-07T20:46:50Z

        _ => BlockstoreProcessorError::FailedToReplayBank0,
    })?;
+
+    bank0.set_block_id(blockstore.get_block_id(bank0.slot(), migration_status)?);


Do we always expect block ID to be Some here?

AshwinSekar · 2026-05-07T21:04:33Z

This seems like the bug no ? If the snapshot sets the block ID for bank 0 correctly can we avoid the change to blockstore purging altogether ?

This is fair and i've fixed it to set correctly now.

As noted elsewhere, I believe this to be a non-starter. The shreds in the blockstore are valid / correct so there should be no need to purge them

Yeah heard for the general case. I suppose I'm specifically worried for the artificial case (e.g. we deactivate a feature or change an account). The shreds for the snapshot will be present but they're not actually corresponding to the snapshot anymore.

I'm thinking this should instead be handled by the operator or ledger-tool to ensure when the artificial snapshot is happening we either use the block id from the shreds or remove the shreds prior to restart.

Either way I agree this is the wrong way to solve it, i'll close this for now and PR just the bank 0 fix separately.

steviez · 2026-05-07T21:16:44Z

Yeah heard for the general case. I suppose I'm specifically worried for the artificial case (e.g. we deactivate a feature or change an account). The shreds for the snapshot will be present but they're not actually corresponding to the snapshot anymore.

Yep, this bit is on my radar as part of #7287. My last comment mentions a tweak I did to get consistent devnet history, altho I didn't mention that I explicitly purged the slot with ledger-tool before doing this. In any case, I'm on this bit; it has just gotten back-burned for other stuff

I'm thinking this should instead be handled by the operator or ledger-tool

This is my plan. Namely, I'm going to have ledger-tool create-snapshot backup the shreds for CHILD_BANK_SLOT (similar to how validator does) and then insert the new slot (just ticks). My last comment in the issue I linked shows the second half (insert new block) but not the first bit (backup old block)

AshwinSekar · 2026-05-07T21:44:26Z

Got it, given that plan do you think we can continue on with activation of SIMD-0340?

If we need to do a cluster restart on a live cluster where we create an artificial bank, we can tell operators to purge the artificial slot's shreds so that this validation case is not exercised.

And on master:

Fix bank 0 not having block id blockstore_processor: set block id for bank 0 #12353
Change broadcast to default to the snapshot's block id broadcast: use block id from snapshot as the CMR for initial blocks #12354
(future) for artificial snapshot have ledger-tool set the block id based on the fake shreds we insert in blockstore - and backup / remove any existing shreds

steviez · 2026-05-07T21:48:33Z

Got it, given that plan do you think we can continue on with activation of SIMD-0340?

Yes

If we need to do a cluster restart on a live cluster where we create an artificial bank, we can tell operators to purge the artificial slot's shreds so that this validation case is not exercised.

Yup, one extra ledger-tool command in the instructions

AshwinSekar force-pushed the fix-val-cmr branch from 20554ab to f45f99b Compare May 7, 2026 18:11

AshwinSekar changed the title ~~fix chained merkle root when building first block on restart~~ startup: purge shreds w/ wrong shred version inclusive of WFSM slot May 7, 2026

AshwinSekar requested review from bw-solana and steviez May 7, 2026 18:25

startup: purge shreds w/ wrong shred version inclusive of WFSM slot

4bfb820

AshwinSekar force-pushed the fix-val-cmr branch from f45f99b to 4bfb820 Compare May 7, 2026 18:33

steviez requested changes May 7, 2026

View reviewed changes

bw-solana reviewed May 7, 2026

View reviewed changes

AshwinSekar closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

startup: purge shreds w/ wrong shred version inclusive of WFSM slot#12343

startup: purge shreds w/ wrong shred version inclusive of WFSM slot#12343
AshwinSekar wants to merge 1 commit intoanza-xyz:masterfrom
AshwinSekar:fix-val-cmr

AshwinSekar commented May 7, 2026 •

edited

Loading

Uh oh!

AshwinSekar commented May 7, 2026

Uh oh!

steviez commented May 7, 2026 •

edited

Loading

Uh oh!

bw-solana commented May 7, 2026 •

edited

Loading

Uh oh!

steviez left a comment

Uh oh!

steviez May 7, 2026

Uh oh!

bw-solana May 7, 2026

Uh oh!

bw-solana left a comment

Uh oh!

bw-solana May 7, 2026

Uh oh!

AshwinSekar commented May 7, 2026

Uh oh!

steviez commented May 7, 2026

Uh oh!

AshwinSekar commented May 7, 2026

Uh oh!

steviez commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AshwinSekar commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

AshwinSekar commented May 7, 2026

Uh oh!

steviez commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw-solana commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steviez left a comment

Choose a reason for hiding this comment

Uh oh!

steviez May 7, 2026

Choose a reason for hiding this comment

Uh oh!

bw-solana May 7, 2026

Choose a reason for hiding this comment

Uh oh!

bw-solana left a comment

Choose a reason for hiding this comment

Uh oh!

bw-solana May 7, 2026

Choose a reason for hiding this comment

Uh oh!

AshwinSekar commented May 7, 2026

Uh oh!

steviez commented May 7, 2026

Uh oh!

AshwinSekar commented May 7, 2026

Uh oh!

steviez commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AshwinSekar commented May 7, 2026 •

edited

Loading

steviez commented May 7, 2026 •

edited

Loading

bw-solana commented May 7, 2026 •

edited

Loading