replay: only vote on blocks with >= 32 data shreds in last fec set by AshwinSekar · Pull Request #1002 · anza-xyz/agave

AshwinSekar · 2024-04-23T17:33:03Z

Problem

In order to ensure that the last erasure batch was sufficiently propagated through turbine, we verify that 32+ shreds are received from turbine or repair.

Summary of Changes

#639 pads the last erasure batch with empty data shreds such that there are at least >= 32 data shreds.
Once a block has finished replay, we can check if the last FEC set is full by seeing if there are >= 32 data shreds with the same merkle root. This implies that at least 32 data or coding shreds were received through turbine or repair.

codecov-commenter · 2024-04-24T23:12:37Z

Codecov Report

Attention: Patch coverage is 83.82353% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (2bc026d) to head (ca90e17).
Report is 10 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #1002    +/-   ##
========================================
  Coverage    82.1%    82.1%            
========================================
  Files         893      893            
  Lines      236600   236736   +136     
========================================
+ Hits       194429   194574   +145     
+ Misses      42171    42162     -9

AshwinSekar · 2024-05-02T16:14:35Z

+                    .feature_set
+                    .is_active(&solana_sdk::feature_set::vote_only_full_fec_sets::id())
+                    // No reason to check our leader block
+                    && bank.collector_id() != my_pubkey


set-identity sanity check:
this should be fine as we do not update my_pubkey between maybe_start_leader and replay_active_banks. In fact we already have a my_pubkey comparison happening in replay_active_bank

agave/core/src/replay_stage.rs

Line 2926 in 12d009e

if bank.collector_id() != my_pubkey {

mergify · 2024-05-02T16:18:43Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

carllin · 2024-05-08T05:53:55Z

+
+                if bank.collector_id() != my_pubkey {
+                    // If the block does not have at least DATA_SHREDS_PER_FEC_BLOCK shreds in the last FEC set,
+                    // process it like a duplicate, which allows us to continue replaying the fork but not vote on it.


Do we want to continue replaying this fork if it's clearly the wrong fork/doesn't meet the protocol requirement of 32 data shreds? Seems like it would be better to mark it as dead and let the duplicate logic handle getting another version if it's available.

I see no benefit to marking it as dead instead of duplicate. Both cases will be handled by both the current and future duplicate confirmed resolution mechanisms.

Keeping it as a duplicate is preferable so we can continue replaying votes on the fork.

talked more offline - will update this to mark as dead.
This is more inline with similar failures (poh failures), and we should not waste resources replaying any descendants.
There is also no "duplicate proof" we can send in this scenario, so dead makes more sense.

For future reference, maybe add the reasoning as a comment in the code why the block is marked as "dead" and not "duplicate"

also, isn't there a concern that the block is rooted by the cluster anyways and so my node has to oblige? (which I am guessing if the block is marked as "dead" then it's not possible).
For example if the firedancer client does not implement this spec precisely, or we hit an edge case (or bug) with blockstore where the meta-data to confirm >=32 data-shreds isn't populated so the check fails only because the blockstore queries fail?

Both "dead" and "duplicate" blocks are resolved similarly if >52% of the cluster votes on it. The differences are that:

Dead blocks always require a dump & repair cycle in order to continue

Duplicate blocks can be continued (marked un-duplicate) without dump & repair if we have the bank hash that matches what the cluster has voted on.

EDIT: If after a dump & repair we still have < 32 data shreds we will mark it as dead again. I think this is okay because > 52% must be faulty to vote on this block.

behzadnouri · 2024-05-08T23:15:38Z

+
+                if bank.collector_id() != my_pubkey {
+                    // If the block does not have at least DATA_SHREDS_PER_FEC_BLOCK shreds in the last FEC set,
+                    // process it like a duplicate, which allows us to continue replaying the fork but not vote on it.


also, isn't there a concern that the block is rooted by the cluster anyways and so my node has to oblige? (which I am guessing if the block is marked as "dead" then it's not possible).
For example if the firedancer client does not implement this spec precisely, or we hit an edge case (or bug) with blockstore where the meta-data to confirm >=32 data-shreds isn't populated so the check fails only because the blockstore queries fail?

…1002) * replay: only vote on blocks with >= 32 data shreds in last fec set * pr feedback: pub(crate), inspect_err * pr feedback: error variants, collapse function, dedup * pr feedback: remove set_last_in_slot, rework test * pr feedback: add metric, perform check regardless of ff * pr feedback: mark block as dead rather than duplicate * pr feedback: self.meta, const_assert, no collect * pr feedback: cfg(test) assertion, remove expect and collect, error fmt * Keep the collect to preserve error * pr feedback: do not hold bank_forks lock for mark_dead_slot (cherry picked from commit 8c67696) # Conflicts: # core/src/replay_stage.rs # ledger/src/blockstore_processor.rs # sdk/src/feature_set.rs

AshwinSekar added the feature-gate label Apr 23, 2024

AshwinSekar force-pushed the full-fec-set branch 2 times, most recently from 2c0f620 to 1bdc167 Compare April 24, 2024 20:55

AshwinSekar force-pushed the full-fec-set branch 5 times, most recently from 7b4ee6c to bc5faca Compare May 2, 2024 16:09

AshwinSekar commented May 2, 2024

View reviewed changes

AshwinSekar marked this pull request as ready for review May 2, 2024 16:14

AshwinSekar requested review from behzadnouri and carllin May 2, 2024 16:16

AshwinSekar added the v1.18 label May 2, 2024

behzadnouri reviewed May 3, 2024

View reviewed changes

AshwinSekar force-pushed the full-fec-set branch from bc5faca to cf3d538 Compare May 7, 2024 18:07

AshwinSekar requested a review from behzadnouri May 7, 2024 19:35

carllin reviewed May 8, 2024

View reviewed changes

AshwinSekar force-pushed the full-fec-set branch 3 times, most recently from 43a241f to b6021cb Compare May 8, 2024 21:04

behzadnouri reviewed May 8, 2024

View reviewed changes

AshwinSekar force-pushed the full-fec-set branch 2 times, most recently from 75d7ada to ca90e17 Compare May 9, 2024 16:42

behzadnouri reviewed May 13, 2024

View reviewed changes

Comment thread core/src/replay_stage.rs Outdated

Comment thread ledger/src/blockstore_db.rs Outdated

Comment thread ledger/src/blockstore.rs

Comment thread ledger/src/blockstore.rs Outdated

Comment thread ledger/src/blockstore.rs

AshwinSekar force-pushed the full-fec-set branch from ca90e17 to a06d04f Compare May 14, 2024 14:50

behzadnouri previously approved these changes May 16, 2024

View reviewed changes

AshwinSekar added 2 commits May 17, 2024 15:13

replay: only vote on blocks with >= 32 data shreds in last fec set

b5f9c43

pr feedback: pub(crate), inspect_err

9007fe0

AshwinSekar added 8 commits May 17, 2024 15:13

pr feedback: error variants, collapse function, dedup

be95a83

pr feedback: remove set_last_in_slot, rework test

36799e9

pr feedback: add metric, perform check regardless of ff

612de8e

pr feedback: mark block as dead rather than duplicate

55bbf01

pr feedback: self.meta, const_assert, no collect

d89395c

pr feedback: cfg(test) assertion, remove expect and collect, error fmt

24ea120

Keep the collect to preserve error

203633d

pr feedback: do not hold bank_forks lock for mark_dead_slot

f14217a

AshwinSekar dismissed behzadnouri’s stale review via f14217a May 17, 2024 15:13

AshwinSekar force-pushed the full-fec-set branch from caa8b53 to f14217a Compare May 17, 2024 15:13

behzadnouri approved these changes May 17, 2024

View reviewed changes

AshwinSekar added the automerge automerge Merge this Pull Request automatically once CI passes label May 17, 2024

mergify Bot merged commit 8c67696 into anza-xyz:master May 17, 2024

mergify Bot mentioned this pull request May 17, 2024

v1.18: replay: only vote on blocks with >= 32 data shreds in last fec set (backport of #1002) #1410

Closed

AshwinSekar deleted the full-fec-set branch May 17, 2024 19:10

This was referenced Jun 4, 2024

replay: do not hold bank forks lock during mark_dead_slot #1597

Merged

Feature Gate: Vote only on full fec sets #1598

Closed

Conversation

AshwinSekar commented Apr 23, 2024

Problem

Summary of Changes

Uh oh!

codecov-commenter commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AshwinSekar May 2, 2024

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 2, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carllin May 8, 2024

Choose a reason for hiding this comment

Uh oh!

AshwinSekar May 8, 2024

Choose a reason for hiding this comment

Uh oh!

AshwinSekar May 8, 2024

Choose a reason for hiding this comment

Uh oh!

behzadnouri May 8, 2024

Choose a reason for hiding this comment

Uh oh!

behzadnouri May 8, 2024

Choose a reason for hiding this comment

Uh oh!

AshwinSekar May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

behzadnouri May 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 24, 2024 •

edited

Loading

AshwinSekar May 9, 2024 •

edited

Loading