FCU Canonical chain corruption after beacon sync (3 fixes + startup healing)#10876
FCU Canonical chain corruption after beacon sync (3 fixes + startup healing)#10876svlachakis merged 44 commits intomasterfrom
Conversation
|
@claude review |
|
Claude finished @smartprogrammer93's task in 3m 53s —— View job Claude's Review
SummaryThe fix is correct and minimal. Removing the 5-line early-return restores the semantics demanded by the Engine API spec: a FCU to any already-known block must always update chain state, not just return The Fix (
|
There was a problem hiding this comment.
Pull request overview
Fixes a forkchoice update edge case where FCU targeting an already-canonical ancestor below the current head was previously short-circuited, leaving the higher descendants incorrectly marked canonical and the head pointer unchanged. This aligns ForkchoiceUpdatedHandler.ApplyForkchoiceUpdate behavior with the intended canonical reorg semantics and adds regression/unit coverage around canonical marking.
Changes:
- Remove the early-return optimization in
ForkchoiceUpdatedHandlersoUpdateMainChainruns even when FCU targets a canonical ancestor below head. - Add an Engine API regression test to ensure FCU to an ancestor moves head back and de-canonicalizes descendants.
- Add a suite of BlockTree unit tests covering canonical marking/unmarking across reorg and lookup scenarios (including persistence/reload).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/Nethermind/Nethermind.Merge.Plugin/Handlers/ForkchoiceUpdatedHandler.cs | Removes the processed-branch early return so FCU to a canonical ancestor triggers main-chain update/de-canonicalization. |
| src/Nethermind/Nethermind.Merge.Plugin.Test/EngineModuleTests.V1.cs | Adds an Engine API regression test reproducing and preventing the “ancestor FCU ignored” canonical mismatch. |
| src/Nethermind/Nethermind.Blockchain.Test/BlockTreeTests.cs | Adds canonical-marking/unit tests for reorg, lookup consistency, and persistence/reload edge cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
@claude review |
|
Claude finished @smartprogrammer93's task in 5m 56s —— View job Claude finished @smartprogrammer93's task — Claude's Review
ContextThis is a different implementation from what the previous Claude review described. The current branch does not touch SummaryThe fix is correct and well-targeted. The Core Fix (
|
|
BTW in ePBS we will get FCU a block and then we can get FCU to the ancestor (not sure if always parent) and we need to support it. How does this affect this scenario? Is it compatible with this during sync? |
Add check_canonical.py that walks the chain backwards via parentHash and compares against eth_getBlockByNumber results to detect stale canonical markers after reorgs -- the exact bug described in NethermindEth/nethermind#10876. Hook the check into run.sh after every setup/*.txt replay (which contains the reorg separator block). Controlled by CANONICAL_CHECK=true env var, enabled by default in the CI workflow. For this temporary testing branch: - Remove PostgreSQL DB push step from repricing-client.yml - Remove artifact zip/upload steps from repricing-client.yml - Remove retrigger job from repricing-nethermind.yml
…c09ad9d0866da67403ffa99d11ba70ec3/core/rawdb/accessors_chain.go#L47 return null for orphaned heights post-merge in GetBlockHashOnMainOrBestDifficultyHash
* small refactors in BlockTree * de-duplicate tests
This reverts commit b7a54d1.
src/Nethermind/Nethermind.Consensus/Stateless/StatelessBlockTree.cs
Outdated
Show resolved
Hide resolved
…ealing) (#10876) * FCU to canonical ancestor silently ignored, leaving descendants canonical * fix comments * fix comment * change approach * remove empty line * HealCanonicalChain implementation & tests * fix build * ePBS FCU fix * geth parity with https://github.com/ethereum/go-ethereum/blob/745b0a8c09ad9d0866da67403ffa99d11ba70ec3/core/rawdb/accessors_chain.go#L47 return null for orphaned heights post-merge in GetBlockHashOnMainOrBestDifficultyHash * GetBlockHashOnMainOrBestDifficultyHash now returns null when HasBlockOnMainChain=true but WasProcessed=false and blockNumber > Head.Number. Beacon sync calls UpdateMainChain(wereProcessed=false), setting HasBlockOnMainChain without advancing Head. If this races with a cleanup FCU where previousHeadNumber == lastNumber, the upward scan runs before the marker is set and cannot clear it — leaving a stale marker that eth_getBlockByNumber would return as canonical. The write-time scan cannot close this window; a read-time guard can. WasProcessed=false precisely identifies beacon-sync markers: processed canonical blocks always have WasProcessed=true, so startup/reload paths are unaffected. * revert change for WasProcessed=false * test: add beacon sync + reorg stale marker reproduction test Reproduction of the stale canonical markers bug from the Engine API test generator: beacon sync marks H+1, H+2, H+3 canonical without advancing Head, then FCU reorgs to a sibling at the same height as Head. Verifies all orphaned levels are de-canonicalized. * test: add failing gap test for beacon sync race condition Adds UpdateMainChain_beacon_sync_gap_in_stale_markers_leaves_orphan_after_reorg which reproduces the scenario where a concurrent MoveToMain creates a gap in stale canonical markers during beacon sync. The break-on-first-gap upward scan stops at the gap and leaves d3 orphaned as stale canonical. This test FAILS on canonical-fix (expected) and PASSES on bounded-scan. * fix: skip gaps in upward scan instead of breaking Change Phase 2 upward scan and ClearStaleMarkersAbove to continue past levels where HasBlockOnMainChain is false, breaking only when the level does not exist. This handles gaps left by concurrent MoveToMain without needing a BestKnownNumber bound. * PR cleanup * remove misleading comment * minor copilot comments * review comments * Canonical fix refactor (#10972) * small refactors in BlockTree * de-duplicate tests * Lukasz review * fixes * Revert "fixes" This reverts commit b7a54d1. * Lukasz review - Refactoring * fixes * revert blocktree registration --------- Co-authored-by: Kamil Chodoła <43241881+kamilchodola@users.noreply.github.com> Co-authored-by: Kamil Chodoła <kamil.chodola@gmail.com> Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>
…d hot-path scan PR #10876 added an unconditional upward scan in UpdateMainChain that clears stale canonical markers left by beacon sync. The scan runs on every call and walks until LoadLevel returns null, which is O(K) per block during forward sync (BlockDownloader) and O(K²) total when BlockchainProcessor falls behind — also incorrectly clearing valid beacon-synced markers ahead of the processing front. The scan is only needed during FCU reorgs, which is the only path that passes forceUpdateHeadBlock: true. Guard the call so forward sync and forward block processing skip the scan entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d hot-path scan PR #10876 added an unconditional upward scan in UpdateMainChain that clears stale canonical markers left by beacon sync. The scan runs on every call and walks until LoadLevel returns null, which is O(K) per block during forward sync (BlockDownloader) and O(K²) total when BlockchainProcessor falls behind — also incorrectly clearing valid beacon-synced markers ahead of the processing front. The scan is only needed during FCU reorgs, which is the only path that passes forceUpdateHeadBlock: true. Guard the call so forward sync and forward block processing skip the scan entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: guard ClearStaleMarkersAbove behind forceUpdateHeadBlock to avoid hot-path scan PR #10876 added an unconditional upward scan in UpdateMainChain that clears stale canonical markers left by beacon sync. The scan runs on every call and walks until LoadLevel returns null, which is O(K) per block during forward sync (BlockDownloader) and O(K²) total when BlockchainProcessor falls behind — also incorrectly clearing valid beacon-synced markers ahead of the processing front. The scan is only needed during FCU reorgs, which is the only path that passes forceUpdateHeadBlock: true. Guard the call so forward sync and forward block processing skip the scan entirely. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ealing) (#10876) * FCU to canonical ancestor silently ignored, leaving descendants canonical * fix comments * fix comment * change approach * remove empty line * HealCanonicalChain implementation & tests * fix build * ePBS FCU fix * geth parity with https://github.com/ethereum/go-ethereum/blob/745b0a8c09ad9d0866da67403ffa99d11ba70ec3/core/rawdb/accessors_chain.go#L47 return null for orphaned heights post-merge in GetBlockHashOnMainOrBestDifficultyHash * GetBlockHashOnMainOrBestDifficultyHash now returns null when HasBlockOnMainChain=true but WasProcessed=false and blockNumber > Head.Number. Beacon sync calls UpdateMainChain(wereProcessed=false), setting HasBlockOnMainChain without advancing Head. If this races with a cleanup FCU where previousHeadNumber == lastNumber, the upward scan runs before the marker is set and cannot clear it — leaving a stale marker that eth_getBlockByNumber would return as canonical. The write-time scan cannot close this window; a read-time guard can. WasProcessed=false precisely identifies beacon-sync markers: processed canonical blocks always have WasProcessed=true, so startup/reload paths are unaffected. * revert change for WasProcessed=false * test: add beacon sync + reorg stale marker reproduction test Reproduction of the stale canonical markers bug from the Engine API test generator: beacon sync marks H+1, H+2, H+3 canonical without advancing Head, then FCU reorgs to a sibling at the same height as Head. Verifies all orphaned levels are de-canonicalized. * test: add failing gap test for beacon sync race condition Adds UpdateMainChain_beacon_sync_gap_in_stale_markers_leaves_orphan_after_reorg which reproduces the scenario where a concurrent MoveToMain creates a gap in stale canonical markers during beacon sync. The break-on-first-gap upward scan stops at the gap and leaves d3 orphaned as stale canonical. This test FAILS on canonical-fix (expected) and PASSES on bounded-scan. * fix: skip gaps in upward scan instead of breaking Change Phase 2 upward scan and ClearStaleMarkersAbove to continue past levels where HasBlockOnMainChain is false, breaking only when the level does not exist. This handles gaps left by concurrent MoveToMain without needing a BestKnownNumber bound. * PR cleanup * remove misleading comment * minor copilot comments * review comments * Canonical fix refactor (#10972) * small refactors in BlockTree * de-duplicate tests * Lukasz review * fixes * Revert "fixes" This reverts commit b7a54d1. * Lukasz review - Refactoring * fixes * revert blocktree registration --------- Co-authored-by: Kamil Chodoła <43241881+kamilchodola@users.noreply.github.com> Co-authored-by: Kamil Chodoła <kamil.chodola@gmail.com> Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com> # Conflicts: # src/Nethermind/Nethermind.Init/Modules/BlockTreeModule.cs # src/Nethermind/Nethermind.TxPool.Test/TestBlockTree.cs
fix: guard ClearStaleMarkersAbove behind forceUpdateHeadBlock to avoid hot-path scan PR #10876 added an unconditional upward scan in UpdateMainChain that clears stale canonical markers left by beacon sync. The scan runs on every call and walks until LoadLevel returns null, which is O(K) per block during forward sync (BlockDownloader) and O(K²) total when BlockchainProcessor falls behind — also incorrectly clearing valid beacon-synced markers ahead of the processing front. The scan is only needed during FCU reorgs, which is the only path that passes forceUpdateHeadBlock: true. Guard the call so forward sync and forward block processing skip the scan entirely. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes Closes Resolves #9906 & #10861
Changes
Fix 1: Stale canonical markers after beacon sync
BlockDownloader calls
UpdateMainChain(wereProcessed: false)to mark synced blocks canonical without updating Head. This creates three cases where stale markers survive:previousHeadNumber == lastNumber) — the unmark loop conditionpreviousHeadNumber > lastNumberis false, so it is skipped entirely.lastNumberbut cannot see beacon-synced markers above the staleHead.MoveToMainclears an intermediate level mid-scan, creating a gap — the old break on first non-canonical level stopped before clearing markers above it.In all three cases orphaned blocks retain
HasBlockOnMainChain = true, causingeth_getBlockByNumberto return the wrong block.Fix: Added an unconditional upward scan after the downward unmark, starting from
Math.Max(previousHeadNumber, lastNumber) + 1, that skips gaps rather than breaking on the first missing marker — covering all three cases in one loop.Fix 2: PoW best-difficulty fallback returns orphaned blocks in PoS
GetBlockHashOnMainOrBestDifficultyHashfell back to highest-TD whenHasBlockOnMainChain=false. In PoS all blocks shareTD=0, so the fallback returned whichever block it found first — after a reorg, the orphaned one.Fix: Post-merge (
Head.TotalDifficulty ≥ TTD), returnnullimmediately with no fallback — matching geth'sReadCanonicalHashsemantics.Fix 3: Startup repair for nodes already affected
Nodes that hit this bug before the fix may have corrupted markers on disk. A new config flag triggers a one-time repair at boot:
BlockTree.HealCanonicalChainruns two phases in a single atomicBatchWrite:HasBlockOnMainChain=truelevels above head left by the sync path.ParentHashfrom head for up toHealCanonicalChainDepthblocks, ensuring every ancestor hasHasBlockOnMainChain=trueat index 0 in itsChainLevelInfoslot.Tests
UpdateMainChain_WhenBeaconSyncMarksThenReorgsToSibling_DecanonalizesDescendant— regression for Fix 1. Fails before, passes after.UpdateMainChain_WhenFcuToAncestorWithStaleBeaconSyncedDescendants_ClearsAll— ePBS scenario. Fails before Math.Max, passes after.UpdateMainChain_WhenGapInBeaconSyncMarkersAndReorging_ClearsStaleMarkersAcrossGap— gap race scenario. Fails with break-on-gap, passes with skip-gaps.FindBlock_WhenBlockOrphanedAfterReorgInPoS_ReturnsNull— regression for Fix 2. Fails before the post-merge guard, passes after.Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Testing Notes
Tested Healing functionallity on our Gnosis Archive node and it works as expected.
Added regression test on our test suite NethermindEth/nethermind-node-tests#36