replay: only vote on blocks with >= 32 data shreds in last fec set#1002
replay: only vote on blocks with >= 32 data shreds in last fec set#1002mergify[bot] merged 10 commits intoanza-xyz:masterfrom
Conversation
2c0f620 to
1bdc167
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1002 +/- ##
========================================
Coverage 82.1% 82.1%
========================================
Files 893 893
Lines 236600 236736 +136
========================================
+ Hits 194429 194574 +145
+ Misses 42171 42162 -9 |
7b4ee6c to
bc5faca
Compare
| .feature_set | ||
| .is_active(&solana_sdk::feature_set::vote_only_full_fec_sets::id()) | ||
| // No reason to check our leader block | ||
| && bank.collector_id() != my_pubkey |
There was a problem hiding this comment.
set-identity sanity check:
this should be fine as we do not update my_pubkey between maybe_start_leader and replay_active_banks. In fact we already have a my_pubkey comparison happening in replay_active_bank
agave/core/src/replay_stage.rs
Line 2926 in 12d009e
|
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
|
|
||
| if bank.collector_id() != my_pubkey { | ||
| // If the block does not have at least DATA_SHREDS_PER_FEC_BLOCK shreds in the last FEC set, | ||
| // process it like a duplicate, which allows us to continue replaying the fork but not vote on it. |
There was a problem hiding this comment.
Do we want to continue replaying this fork if it's clearly the wrong fork/doesn't meet the protocol requirement of 32 data shreds? Seems like it would be better to mark it as dead and let the duplicate logic handle getting another version if it's available.
There was a problem hiding this comment.
I see no benefit to marking it as dead instead of duplicate. Both cases will be handled by both the current and future duplicate confirmed resolution mechanisms.
Keeping it as a duplicate is preferable so we can continue replaying votes on the fork.
There was a problem hiding this comment.
talked more offline - will update this to mark as dead.
This is more inline with similar failures (poh failures), and we should not waste resources replaying any descendants.
There is also no "duplicate proof" we can send in this scenario, so dead makes more sense.
There was a problem hiding this comment.
For future reference, maybe add the reasoning as a comment in the code why the block is marked as "dead" and not "duplicate"
There was a problem hiding this comment.
also, isn't there a concern that the block is rooted by the cluster anyways and so my node has to oblige? (which I am guessing if the block is marked as "dead" then it's not possible).
For example if the firedancer client does not implement this spec precisely, or we hit an edge case (or bug) with blockstore where the meta-data to confirm >=32 data-shreds isn't populated so the check fails only because the blockstore queries fail?
There was a problem hiding this comment.
Both "dead" and "duplicate" blocks are resolved similarly if >52% of the cluster votes on it. The differences are that:
- Dead blocks always require a dump & repair cycle in order to continue
- Duplicate blocks can be continued (marked un-duplicate) without dump & repair if we have the bank hash that matches what the cluster has voted on.
EDIT: If after a dump & repair we still have < 32 data shreds we will mark it as dead again. I think this is okay because > 52% must be faulty to vote on this block.
43a241f to
b6021cb
Compare
|
|
||
| if bank.collector_id() != my_pubkey { | ||
| // If the block does not have at least DATA_SHREDS_PER_FEC_BLOCK shreds in the last FEC set, | ||
| // process it like a duplicate, which allows us to continue replaying the fork but not vote on it. |
There was a problem hiding this comment.
also, isn't there a concern that the block is rooted by the cluster anyways and so my node has to oblige? (which I am guessing if the block is marked as "dead" then it's not possible).
For example if the firedancer client does not implement this spec precisely, or we hit an edge case (or bug) with blockstore where the meta-data to confirm >=32 data-shreds isn't populated so the check fails only because the blockstore queries fail?
75d7ada to
ca90e17
Compare
…1002) * replay: only vote on blocks with >= 32 data shreds in last fec set * pr feedback: pub(crate), inspect_err * pr feedback: error variants, collapse function, dedup * pr feedback: remove set_last_in_slot, rework test * pr feedback: add metric, perform check regardless of ff * pr feedback: mark block as dead rather than duplicate * pr feedback: self.meta, const_assert, no collect * pr feedback: cfg(test) assertion, remove expect and collect, error fmt * Keep the collect to preserve error * pr feedback: do not hold bank_forks lock for mark_dead_slot (cherry picked from commit 8c67696) # Conflicts: # core/src/replay_stage.rs # ledger/src/blockstore_processor.rs # sdk/src/feature_set.rs
…1002) * replay: only vote on blocks with >= 32 data shreds in last fec set * pr feedback: pub(crate), inspect_err * pr feedback: error variants, collapse function, dedup * pr feedback: remove set_last_in_slot, rework test * pr feedback: add metric, perform check regardless of ff * pr feedback: mark block as dead rather than duplicate * pr feedback: self.meta, const_assert, no collect * pr feedback: cfg(test) assertion, remove expect and collect, error fmt * Keep the collect to preserve error * pr feedback: do not hold bank_forks lock for mark_dead_slot (cherry picked from commit 8c67696) # Conflicts: # core/src/replay_stage.rs # ledger/src/blockstore_processor.rs # sdk/src/feature_set.rs
Continued from solana-labs#35024
Problem
In order to ensure that the last erasure batch was sufficiently propagated through turbine, we verify that 32+ shreds are received from turbine or repair.
Summary of Changes
#639 pads the last erasure batch with empty data shreds such that there are at least >= 32 data shreds.
Once a block has finished replay, we can check if the last FEC set is full by seeing if there are >= 32 data shreds with the same merkle root. This implies that at least 32 data or coding shreds were received through turbine or repair.