Skip to content

repair: plumb shred resolver service, and send events from blockstore#428

Closed
AshwinSekar wants to merge 5 commits intoanza-xyz:masterfrom
AshwinSekar:shred-resolver
Closed

repair: plumb shred resolver service, and send events from blockstore#428
AshwinSekar wants to merge 5 commits intoanza-xyz:masterfrom
AshwinSekar:shred-resolver

Conversation

@AshwinSekar
Copy link
Copy Markdown
Contributor

@AshwinSekar AshwinSekar commented Aug 28, 2025

Problem

We need a controller to receive requests to repair blocks from the pool/replay and send requests to the network.
It needs to be blockstore aware so that it can send out additional repairs and reconcile invalid responses.

Additionally we don't want to poll blockstore every time to see if our requests have completed. I would like to set it up so that we have the minimal blockstore interactions to find out which shreds to request, and then can continue on notifications from blockstore.

Summary of Changes

Plumb together a new service for this (open to naming suggestions), impl in the next PR. Send events from blockstore for use as feedback in this service.

@AshwinSekar AshwinSekar force-pushed the shred-resolver branch 4 times, most recently from dffe763 to b0eb8f0 Compare September 4, 2025 15:30
@AshwinSekar AshwinSekar marked this pull request as ready for review September 4, 2025 15:36
@bw-solana bw-solana requested a review from cpubot September 4, 2025 17:46
@bw-solana
Copy link
Copy Markdown
Contributor

Adding @cpubot just for awareness. He was going to look into some of the shred/blockstore --> replay flow performance

@@ -0,0 +1,51 @@
//! The shred resolver service listens to events emitted by
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe duplicate_shred_resolver_service

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was that should be a replacement to repair_service.rs so it sends out normal repair as well as duplicate scenarios.

wanted to avoid calling it ag_repair_service 😅

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just drag and drop it into repair_service when the time comes then

Comment on lines +40 to +43
let Ok(_events) = event_receiver.recv() else {
break;
};
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to be consistent with the error handling in block_creation_loop and event_handler:

match record_receiver.recv_timeout(Duration::from_millis(400)) {
Ok(record) => {
if record
.sender
.send(poh_recorder.write().unwrap().record(
record.slot,
record.mixins,
record.transaction_batches,
))
.is_err()
{
panic!("Error returning mixin hashes");
}
}
Err(RecvTimeoutError::Disconnected) => {
info!("Record receiver disconnected");
return;
}
Err(RecvTimeoutError::Timeout) => (),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah definitely, i'll add proper error handling when i fill in the impl in the next PR

Comment thread ledger/src/shred_event.rs Outdated
Comment on lines +44 to +53
lower_merkle_root: Hash,
higher_fec_set_index: u32,
higher_chained_merkle_root: Hash,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does lower vs higher mean in this context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMR conflict occurs when 2 adjacent FEC sets don't chain together correctly. the lower (by index) FEC sets' merkle root is not equal to the higher (by index) FEC sets' chained merkle root.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually #437 will give us anza-xyz/agave#7728 which enforces fixed 32:32 FEC sets.
Once that lands I can remove higher_fec_set_index as it will just be +32 (and some other places where I specify both index and fec_set_index). Will wait to merge this until then.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah dope was thinking it would be clearer to call it previous/current merkle root instead of lower/higher

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done I refactored now 5c7c2f5 we only need to track the previous fec set

Comment thread ledger/src/blockstore_meta.rs Outdated
Comment thread ledger/src/blockstore_meta.rs Outdated
Comment thread ledger/src/blockstore.rs

if let Some(shred_event_sender) = shred_event_sender {
for event in shred_events {
// TODO: handle error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need to resolve this on startup right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. do we need to resend these events if we shut down, and then start back up again before these events are handled

Copy link
Copy Markdown
Contributor Author

@AshwinSekar AshwinSekar Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be fine, essentially repair will have to perform a blockstore scan initially (when receiving a certificate to initiate catchup, or a request to repair an alternate version from replay/safe to notar). The idea behind these events are just so we don't have to rescan to see if we've finished.

So if we restart then we don't need to resend these events.

Comment thread ledger/src/blockstore.rs
Comment thread ledger/src/shred_event.rs
Comment thread ledger/src/shred_event.rs
Comment thread ledger/src/blockstore.rs
Comment on lines +2119 to +2124
.push(ShredEvent::CompletedFECSet {
location,
slot,
fec_set_index,
is_last_in_slot,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only be sent once because each shred in an FEC set only calls into this function check_insert_data_shred() once? i.e. we filter out existing shreds before this function is called

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, we have a check up above (see PossibleDuplicateShred::Exists) that makes sure we don't already have the shred. So regardless of whether we naturally complete the FEC set or complete it as a result of FEC recovery, we'll only send this event out once.

Comment thread ledger/src/blockstore.rs
Comment on lines +2355 to +2362
shred_events.push(ShredEvent::ChainedMerkleRootConflict {
location,
slot,
lower_fec_set_index: erasure_set.fec_set_index(),
lower_merkle_root: merkle_root.unwrap_or_default(),
higher_fec_set_index: next_erasure_set.fec_set_index(),
higher_chained_merkle_root: chained_merkle_root.unwrap_or_default(),
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the repair controller will be in charge of dumping these bad shreds?

For future PR's, might be good to organize them per event handler. Have a feeling it's going to be complicated 🫡

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly the MerkleRootConflict / ChainedMerkleRootConflict won't happen during normal operation, only if we're receiving duplicate blocks or getting a malicious repair response.

I wanted to avoid having to rescan blockstore after x ms and dump/re request repair in this case. Since it's a low prob event figured it's better to have it event driven - the repair controller gets notified when this happens and takes action.

Comment thread ledger/src/blockstore_meta.rs Outdated
pub(crate) fn is_data_set_complete(fec_set_index: u32, index: &Index) -> bool {
let data_indices =
u64::from(fec_set_index)..u64::from(fec_set_index) + (DATA_SHREDS_PER_FEC_BLOCK as u64);
index.data().range(data_indices).count() == DATA_SHREDS_PER_FEC_BLOCK
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could take the count_ones fastpath here

pub fn count_ones(self) -> usize {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 7bc4555

@AshwinSekar
Copy link
Copy Markdown
Contributor Author

We're going a different direction with repair, closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants