Skip to content

replay: only vote on blocks with >= 32 data shreds in last fec set#1002

Merged
mergify[bot] merged 10 commits intoanza-xyz:masterfrom
AshwinSekar:full-fec-set
May 17, 2024
Merged

replay: only vote on blocks with >= 32 data shreds in last fec set#1002
mergify[bot] merged 10 commits intoanza-xyz:masterfrom
AshwinSekar:full-fec-set

Conversation

@AshwinSekar
Copy link
Copy Markdown

Continued from solana-labs#35024

Problem

In order to ensure that the last erasure batch was sufficiently propagated through turbine, we verify that 32+ shreds are received from turbine or repair.

Summary of Changes

#639 pads the last erasure batch with empty data shreds such that there are at least >= 32 data shreds.
Once a block has finished replay, we can check if the last FEC set is full by seeing if there are >= 32 data shreds with the same merkle root. This implies that at least 32 data or coding shreds were received through turbine or repair.

@AshwinSekar AshwinSekar force-pushed the full-fec-set branch 2 times, most recently from 2c0f620 to 1bdc167 Compare April 24, 2024 20:55
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2024

Codecov Report

Attention: Patch coverage is 83.82353% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (2bc026d) to head (ca90e17).
Report is 10 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #1002    +/-   ##
========================================
  Coverage    82.1%    82.1%            
========================================
  Files         893      893            
  Lines      236600   236736   +136     
========================================
+ Hits       194429   194574   +145     
+ Misses      42171    42162     -9     

@AshwinSekar AshwinSekar force-pushed the full-fec-set branch 5 times, most recently from 7b4ee6c to bc5faca Compare May 2, 2024 16:09
Comment thread core/src/replay_stage.rs Outdated
.feature_set
.is_active(&solana_sdk::feature_set::vote_only_full_fec_sets::id())
// No reason to check our leader block
&& bank.collector_id() != my_pubkey
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set-identity sanity check:
this should be fine as we do not update my_pubkey between maybe_start_leader and replay_active_banks. In fact we already have a my_pubkey comparison happening in replay_active_bank

if bank.collector_id() != my_pubkey {

@AshwinSekar AshwinSekar marked this pull request as ready for review May 2, 2024 16:14
@AshwinSekar AshwinSekar requested review from behzadnouri and carllin May 2, 2024 16:16
@mergify
Copy link
Copy Markdown

mergify Bot commented May 2, 2024

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

Comment thread core/src/repair/cluster_slot_state_verifier.rs Outdated
Comment thread core/src/replay_stage.rs Outdated
Comment thread core/src/replay_stage.rs Outdated
Comment thread ledger/src/blockstore.rs Outdated
Comment thread ledger/src/blockstore.rs Outdated
Comment thread ledger/src/blockstore.rs Outdated
Comment thread ledger/src/blockstore.rs Outdated
Comment thread core/src/replay_stage.rs Outdated
Comment thread ledger/src/blockstore.rs Outdated
Comment thread ledger/src/shred/shred_data.rs
Comment thread core/src/replay_stage.rs Outdated

if bank.collector_id() != my_pubkey {
// If the block does not have at least DATA_SHREDS_PER_FEC_BLOCK shreds in the last FEC set,
// process it like a duplicate, which allows us to continue replaying the fork but not vote on it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to continue replaying this fork if it's clearly the wrong fork/doesn't meet the protocol requirement of 32 data shreds? Seems like it would be better to mark it as dead and let the duplicate logic handle getting another version if it's available.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no benefit to marking it as dead instead of duplicate. Both cases will be handled by both the current and future duplicate confirmed resolution mechanisms.

Keeping it as a duplicate is preferable so we can continue replaying votes on the fork.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked more offline - will update this to mark as dead.
This is more inline with similar failures (poh failures), and we should not waste resources replaying any descendants.
There is also no "duplicate proof" we can send in this scenario, so dead makes more sense.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, maybe add the reasoning as a comment in the code why the block is marked as "dead" and not "duplicate"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, isn't there a concern that the block is rooted by the cluster anyways and so my node has to oblige? (which I am guessing if the block is marked as "dead" then it's not possible).
For example if the firedancer client does not implement this spec precisely, or we hit an edge case (or bug) with blockstore where the meta-data to confirm >=32 data-shreds isn't populated so the check fails only because the blockstore queries fail?

Copy link
Copy Markdown
Author

@AshwinSekar AshwinSekar May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both "dead" and "duplicate" blocks are resolved similarly if >52% of the cluster votes on it. The differences are that:

  • Dead blocks always require a dump & repair cycle in order to continue
  • Duplicate blocks can be continued (marked un-duplicate) without dump & repair if we have the bank hash that matches what the cluster has voted on.

EDIT: If after a dump & repair we still have < 32 data shreds we will mark it as dead again. I think this is okay because > 52% must be faulty to vote on this block.

@AshwinSekar AshwinSekar force-pushed the full-fec-set branch 3 times, most recently from 43a241f to b6021cb Compare May 8, 2024 21:04
Comment thread core/src/replay_stage.rs
Comment thread core/src/replay_stage.rs Outdated

if bank.collector_id() != my_pubkey {
// If the block does not have at least DATA_SHREDS_PER_FEC_BLOCK shreds in the last FEC set,
// process it like a duplicate, which allows us to continue replaying the fork but not vote on it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, isn't there a concern that the block is rooted by the cluster anyways and so my node has to oblige? (which I am guessing if the block is marked as "dead" then it's not possible).
For example if the firedancer client does not implement this spec precisely, or we hit an edge case (or bug) with blockstore where the meta-data to confirm >=32 data-shreds isn't populated so the check fails only because the blockstore queries fail?

Comment thread ledger/src/blockstore.rs Outdated
Comment thread ledger/src/blockstore.rs Outdated
Comment thread core/src/replay_stage.rs
Comment thread ledger/src/blockstore.rs Outdated
@AshwinSekar AshwinSekar force-pushed the full-fec-set branch 2 times, most recently from 75d7ada to ca90e17 Compare May 9, 2024 16:42
Comment thread core/src/replay_stage.rs Outdated
Comment thread ledger/src/blockstore_db.rs Outdated
Comment thread ledger/src/blockstore.rs
Comment thread ledger/src/blockstore.rs Outdated
Comment thread ledger/src/blockstore.rs
behzadnouri
behzadnouri previously approved these changes May 16, 2024
@AshwinSekar AshwinSekar added the automerge automerge Merge this Pull Request automatically once CI passes label May 17, 2024
@mergify mergify Bot merged commit 8c67696 into anza-xyz:master May 17, 2024
mergify Bot pushed a commit that referenced this pull request May 17, 2024
…1002)

* replay: only vote on blocks with >= 32 data shreds in last fec set

* pr feedback: pub(crate), inspect_err

* pr feedback: error variants, collapse function, dedup

* pr feedback: remove set_last_in_slot, rework test

* pr feedback: add metric, perform check regardless of ff

* pr feedback: mark block as dead rather than duplicate

* pr feedback: self.meta, const_assert, no collect

* pr feedback: cfg(test) assertion, remove expect and collect, error fmt

* Keep the collect to preserve error

* pr feedback: do not hold bank_forks lock for mark_dead_slot

(cherry picked from commit 8c67696)

# Conflicts:
#	core/src/replay_stage.rs
#	ledger/src/blockstore_processor.rs
#	sdk/src/feature_set.rs
@AshwinSekar AshwinSekar deleted the full-fec-set branch May 17, 2024 19:10
AshwinSekar added a commit that referenced this pull request Jun 4, 2024
…1002)

* replay: only vote on blocks with >= 32 data shreds in last fec set

* pr feedback: pub(crate), inspect_err

* pr feedback: error variants, collapse function, dedup

* pr feedback: remove set_last_in_slot, rework test

* pr feedback: add metric, perform check regardless of ff

* pr feedback: mark block as dead rather than duplicate

* pr feedback: self.meta, const_assert, no collect

* pr feedback: cfg(test) assertion, remove expect and collect, error fmt

* Keep the collect to preserve error

* pr feedback: do not hold bank_forks lock for mark_dead_slot

(cherry picked from commit 8c67696)

# Conflicts:
#	core/src/replay_stage.rs
#	ledger/src/blockstore_processor.rs
#	sdk/src/feature_set.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge automerge Merge this Pull Request automatically once CI passes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants